Exploratory analysis and linear regression analysis of abnormal price changes and surprise percentages over different day windows
In [38]:
import pandas as pd
In [39]:
df_price_changes = pd.read_csv('Adjusted return Drifts.csv')
In [40]:
df_price_changes
Out[40]:
| APPL - 3 Day Drift Change - abnormal | APPL - 5 Day Drift Change - abnormal | APPL - 10 Day Drift Change - abnormal | surprise | surprisePercentage | |
|---|---|---|---|---|---|
| 0 | -2.53% | -5.17% | -2.50% | 0.0550 | 73.3333 |
| 1 | 3.54% | 2.93% | 2.47% | 0.0325 | 37.1429 |
| 2 | -0.85% | -0.27% | -1.73% | 0.0150 | 13.6364 |
| 3 | -1.22% | -1.90% | -3.30% | 0.0200 | 13.7931 |
| 4 | -3.69% | 0.03% | -0.18% | 0.0375 | 19.4805 |
| ... | ... | ... | ... | ... | ... |
| 58 | -3.75% | -2.47% | -1.35% | 0.0600 | 4.4776 |
| 59 | -0.70% | -2.21% | -1.35% | 0.0200 | 2.1053 |
| 60 | -1.31% | -1.89% | 1.22% | 0.0600 | 2.5641 |
| 61 | -1.93% | -3.43% | -0.94% | 0.0300 | 1.8519 |
| 62 | -0.71% | 7.09% | 11.46% | 0.1400 | 9.7902 |
63 rows × 5 columns
In [41]:
df_price_changes['APPL - 3 Day Drift Change - abnormal'] = (
df_price_changes['APPL - 3 Day Drift Change - abnormal']
.astype(str)
.str.replace('%', '', regex=False)
.astype(float)
)
df_price_changes['APPL - 5 Day Drift Change - abnormal'] = (
df_price_changes['APPL - 5 Day Drift Change - abnormal']
.astype(str)
.str.replace('%', '', regex=False)
.astype(float)
)
df_price_changes['APPL - 10 Day Drift Change - abnormal'] = (
df_price_changes['APPL - 10 Day Drift Change - abnormal']
.astype(str)
.str.replace('%', '', regex=False)
.astype(float)
)
In [42]:
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np
# Define X and y values
X = df_price_changes[['surprisePercentage']] # must be 2D for sklearn
y = df_price_changes['APPL - 3 Day Drift Change - abnormal']
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
# Extract parameters
slope = model.coef_[0]
intercept = model.intercept_
r_squared = model.score(X, y)
print(f"Slope: {slope:.4f}")
print(f"Intercept: {intercept:.4f}")
print(f"R-squared: {r_squared:.4f}")
# Make predictions
y_pred = model.predict(X)
# Plot visualisation
plt.figure(figsize=(8,5))
plt.scatter(X, y, color='blue', label='Actual data')
plt.plot(X, y_pred, color='red', linewidth=2, label='Regression line')
plt.title('Linear Regression: Abnormal Returns vs Surprise Percentage')
plt.xlabel('Surprise Percentage')
plt.ylabel('APPL - 3 Day Drift Change - abnormal')
plt.legend()
plt.show()
Slope: 0.0104 Intercept: -0.2896 R-squared: 0.0048
In [43]:
# Assuming df_price_changes is already loaded
# Define X and y
X = df_price_changes[['surprisePercentage']] # must be 2D for sklearn
y = df_price_changes['APPL - 5 Day Drift Change - abnormal']
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
# Extract parameters
slope = model.coef_[0]
intercept = model.intercept_
r_squared = model.score(X, y)
print(f"Slope: {slope:.4f}")
print(f"Intercept: {intercept:.4f}")
print(f"R-squared: {r_squared:.4f}")
# Make predictions
y_pred = model.predict(X)
# Plot
plt.figure(figsize=(8,5))
plt.scatter(X, y, color='blue', label='Actual data')
plt.plot(X, y_pred, color='red', linewidth=2, label='Regression line')
plt.title('Linear Regression: Abnormal Returns vs Surprise Percentage')
plt.xlabel('Surprise Percentage')
plt.ylabel('APPL - 5 Day Drift Change - abnormal')
plt.legend()
plt.show()
Slope: -0.0219 Intercept: 0.5670 R-squared: 0.0098
In [44]:
# Assuming df_price_changes is already loaded
# Define X and y
X = df_price_changes[['surprisePercentage']] # must be 2D for sklearn
y = df_price_changes['APPL - 10 Day Drift Change - abnormal']
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
# Extract parameters
slope = model.coef_[0]
intercept = model.intercept_
r_squared = model.score(X, y)
print(f"Slope: {slope:.4f}")
print(f"Intercept: {intercept:.4f}")
print(f"R-squared: {r_squared:.4f}")
# Make predictions
y_pred = model.predict(X)
# Plot
plt.figure(figsize=(8,5))
plt.scatter(X, y, color='blue', label='Actual data')
plt.plot(X, y_pred, color='red', linewidth=2, label='Regression line')
plt.title('Linear Regression: Abnormal Returns vs Surprise Percentage')
plt.xlabel('Surprise Percentage')
plt.ylabel('APPL - 10 Day Drift Change - abnormal')
plt.legend()
plt.show()
Slope: -0.0072 Intercept: 0.6145 R-squared: 0.0005
Multilinear regressions across separate day windows - Exploratory models using manually cleaned spreadsheet data
In [91]:
df_price_changes_multilinear = pd.read_csv('Abnormal Returns - Multi Linear regressions.csv')
In [92]:
df_price_changes_multilinear['APPL - 3 Day Drift Change - abnormal'] = (
df_price_changes_multilinear['APPL - 3 Day Drift Change - abnormal']
.astype(str)
.str.replace('%', '', regex=False)
.astype(float)
)
df_price_changes_multilinear['APPL - 5 Day Drift Change - abnormal'] = (
df_price_changes_multilinear['APPL - 5 Day Drift Change - abnormal']
.astype(str)
.str.replace('%', '', regex=False)
.astype(float)
)
df_price_changes_multilinear['APPL - 10 Day Drift Change - abnormal'] = (
df_price_changes_multilinear['APPL - 10 Day Drift Change - abnormal']
.astype(str)
.str.replace('%', '', regex=False)
.astype(float)
)
df_price_changes_multilinear['APPL - 3 Day before announcement change - abnormal'] = (
df_price_changes_multilinear['APPL - 3 Day before announcement change - abnormal']
.astype(str)
.str.replace('%', '', regex=False)
.astype(float)
)
df_price_changes_multilinear['APPL - 10 Day before announcement change - abnormal'] = (
df_price_changes_multilinear['APPL - 10 Day before announcement change - abnormal']
.astype(str)
.str.replace('%', '', regex=False)
.astype(float)
)
df_price_changes_multilinear['APPL - 20 Day Drift Change - abnormal'] = (
df_price_changes_multilinear['APPL - 20 Day Drift Change - abnormal']
.astype(str)
.str.replace('%', '', regex=False)
.astype(float)
)
In [93]:
df_price_changes_multilinear
Out[93]:
| Date | First day of month | Adj Close | Close | High | Low | Open | Volume | APPL - daily change | APPL - daily change - abnormal | ... | CPI | fiscalDateEnding | reportedDate | reportedEPS | estimatedEPS | surprise | surprisePercentage | reportTime | symbol | totalRevenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25/01/2010 | 01/01/2010 | 6.096186 | 7.252500 | 7.310714 | 7.149643 | 7.232500 | 1065699600 | 2.69% | 2.23% | ... | 216.687 | 31/12/2009 | 25/01/2010 | 0.130 | 0.0750 | 0.0550 | 73.3333 | post-market | AAPL | 1.568300e+10 |
| 1 | 20/04/2010 | 01/04/2010 | 7.342619 | 8.735357 | 8.901786 | 8.677143 | 8.876429 | 738326400 | -1.00% | -1.81% | ... | 218.009 | 31/03/2010 | 20/04/2010 | 0.120 | 0.0875 | 0.0325 | 37.1429 | post-market | AAPL | 1.349900e+10 |
| 2 | 20/07/2010 | 01/07/2010 | 7.561765 | 8.996071 | 9.032143 | 8.571786 | 8.675000 | 1074950800 | 2.57% | 1.43% | ... | 218.011 | 30/06/2010 | 20/07/2010 | 0.125 | 0.1100 | 0.0150 | 13.6364 | post-market | AAPL | 1.570000e+10 |
| 3 | 18/10/2010 | 01/10/2010 | 9.546394 | 11.357143 | 11.392857 | 11.224643 | 11.373929 | 1093010800 | 1.04% | 0.31% | ... | 218.711 | 30/09/2010 | 18/10/2010 | 0.165 | 0.1450 | 0.0200 | 13.7931 | post-market | AAPL | 2.034300e+10 |
| 4 | 18/01/2011 | 01/01/2011 | 10.226353 | 12.166071 | 12.312857 | 11.642857 | 11.768571 | 1880998000 | -2.25% | -2.38% | ... | 220.223 | 31/12/2010 | 18/01/2011 | 0.230 | 0.1925 | 0.0375 | 19.4805 | post-market | AAPL | 2.674100e+10 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 58 | 01/08/2024 | 01/08/2024 | 217.097168 | 218.360001 | 224.479996 | 217.020004 | 224.369995 | 62501000 | -1.68% | -0.31% | ... | 314.796 | 30/06/2024 | 01/08/2024 | 1.400 | 1.3400 | 0.0600 | 4.4776 | post-market | AAPL | 8.577700e+10 |
| 59 | 31/10/2024 | 01/10/2024 | 224.863480 | 225.910004 | 229.830002 | 225.369995 | 229.339996 | 64370100 | -1.82% | 0.04% | ... | 315.664 | 30/09/2024 | 31/10/2024 | 0.970 | 0.9500 | 0.0200 | 2.1053 | post-market | AAPL | 9.493000e+10 |
| 60 | 30/01/2025 | 01/01/2025 | 236.749542 | 237.589996 | 240.789993 | 237.210007 | 238.669998 | 55658300 | -0.74% | -1.27% | ... | 317.671 | 31/12/2024 | 30/01/2025 | 2.400 | 2.3400 | 0.0600 | 2.5641 | post-market | AAPL | 1.240000e+11 |
| 61 | 01/05/2025 | 01/05/2025 | 212.799133 | 213.320007 | 214.559998 | 208.899994 | 209.080002 | 57365700 | 0.39% | -0.24% | ... | 321.465 | 31/03/2025 | 01/05/2025 | 1.650 | 1.6200 | 0.0300 | 1.8519 | post-market | AAPL | 9.535900e+10 |
| 62 | 31/07/2025 | 01/07/2025 | 207.334701 | 207.570007 | 209.839996 | 207.160004 | 208.490005 | 80698400 | -0.71% | -0.34% | ... | 323.048 | 30/06/2025 | 31/07/2025 | 1.570 | 1.4300 | 0.1400 | 9.7902 | post-market | AAPL | 9.403600e+10 |
63 rows × 39 columns
In [94]:
pip install pandas statsmodels
Requirement already satisfied: pandas in c:\users\aledr\anaconda3\lib\site-packages (2.2.3)Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: statsmodels in c:\users\aledr\anaconda3\lib\site-packages (0.14.4) Requirement already satisfied: numpy>=1.26.0 in c:\users\aledr\anaconda3\lib\site-packages (from pandas) (2.1.3) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\aledr\anaconda3\lib\site-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in c:\users\aledr\anaconda3\lib\site-packages (from pandas) (2024.1) Requirement already satisfied: tzdata>=2022.7 in c:\users\aledr\anaconda3\lib\site-packages (from pandas) (2025.2) Requirement already satisfied: scipy!=1.9.2,>=1.8 in c:\users\aledr\anaconda3\lib\site-packages (from statsmodels) (1.15.3) Requirement already satisfied: patsy>=0.5.6 in c:\users\aledr\anaconda3\lib\site-packages (from statsmodels) (1.0.1) Requirement already satisfied: packaging>=21.3 in c:\users\aledr\anaconda3\lib\site-packages (from statsmodels) (24.2) Requirement already satisfied: six>=1.5 in c:\users\aledr\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
In [96]:
# Define the independent variables (same for all three)
X = df_price_changes_multilinear[['surprisePercentage',
'Vix - Close - Pre Day',
'Fed Funds Rate',
'APPL - 10 Day before announcement change - abnormal']]
# Add a constant (intercept)
X = sm.add_constant(X)
# ----------------------------------------------------------
# Model 1 — 3-Day Drift
# ----------------------------------------------------------
y1 = df_price_changes_multilinear['APPL - 3 Day Drift Change - abnormal']
model1 = sm.OLS(y1, X).fit()
print("=== Model 1: 3-Day Drift ===")
print(model1.summary())
print("\n")
# ----------------------------------------------------------
# Model 2 — 5-Day Drift
# ----------------------------------------------------------
y2 = df_price_changes_multilinear['APPL - 5 Day Drift Change - abnormal']
model2 = sm.OLS(y2, X).fit()
print("=== Model 2: 5-Day Drift ===")
print(model2.summary())
print("\n")
# ----------------------------------------------------------
# Model 3 — 10-Day Drift
# ----------------------------------------------------------
y3 = df_price_changes_multilinear['APPL - 10 Day Drift Change - abnormal']
model3 = sm.OLS(y3, X).fit()
print("=== Model 3: 10-Day Drift ===")
print(model3.summary())
# ----------------------------------------------------------
# Model 4 — 20-Day Drift
# ----------------------------------------------------------
y3 = df_price_changes_multilinear['APPL - 20 Day Drift Change - abnormal']
model3 = sm.OLS(y3, X).fit()
print("=== Model 4: 20-Day Drift ===")
print(model3.summary())
=== Model 1: 3-Day Drift ===
OLS Regression Results
================================================================================================
Dep. Variable: APPL - 3 Day Drift Change - abnormal R-squared: 0.113
Model: OLS Adj. R-squared: 0.052
Method: Least Squares F-statistic: 1.843
Date: Fri, 31 Oct 2025 Prob (F-statistic): 0.133
Time: 15:58:24 Log-Likelihood: -128.73
No. Observations: 63 AIC: 267.5
Df Residuals: 58 BIC: 278.2
Df Model: 4
Covariance Type: nonrobust
=======================================================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------------------
const 1.4165 0.808 1.753 0.085 -0.201 3.034
surprisePercentage 0.0105 0.019 0.538 0.593 -0.028 0.049
Vix - Close - Pre Day -0.0802 0.040 -2.019 0.048 -0.160 -0.001
Fed Funds Rate -0.1535 0.142 -1.080 0.285 -0.438 0.131
APPL - 10 Day before announcement change - abnormal -0.0757 0.070 -1.086 0.282 -0.215 0.064
==============================================================================
Omnibus: 0.527 Durbin-Watson: 2.084
Prob(Omnibus): 0.768 Jarque-Bera (JB): 0.673
Skew: 0.137 Prob(JB): 0.714
Kurtosis: 2.574 Cond. No. 74.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
=== Model 2: 5-Day Drift ===
OLS Regression Results
================================================================================================
Dep. Variable: APPL - 5 Day Drift Change - abnormal R-squared: 0.084
Model: OLS Adj. R-squared: 0.021
Method: Least Squares F-statistic: 1.329
Date: Fri, 31 Oct 2025 Prob (F-statistic): 0.270
Time: 15:58:24 Log-Likelihood: -154.51
No. Observations: 63 AIC: 319.0
Df Residuals: 58 BIC: 329.7
Df Model: 4
Covariance Type: nonrobust
=======================================================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------------------
const 2.5718 1.217 2.114 0.039 0.136 5.007
surprisePercentage -0.0289 0.029 -0.987 0.328 -0.088 0.030
Vix - Close - Pre Day -0.0799 0.060 -1.337 0.186 -0.200 0.040
Fed Funds Rate -0.3304 0.214 -1.544 0.128 -0.759 0.098
APPL - 10 Day before announcement change - abnormal -0.0651 0.105 -0.620 0.538 -0.275 0.145
==============================================================================
Omnibus: 0.663 Durbin-Watson: 2.115
Prob(Omnibus): 0.718 Jarque-Bera (JB): 0.279
Skew: 0.147 Prob(JB): 0.870
Kurtosis: 3.140 Cond. No. 74.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
=== Model 3: 10-Day Drift ===
OLS Regression Results
=================================================================================================
Dep. Variable: APPL - 10 Day Drift Change - abnormal R-squared: 0.022
Model: OLS Adj. R-squared: -0.045
Method: Least Squares F-statistic: 0.3285
Date: Fri, 31 Oct 2025 Prob (F-statistic): 0.858
Time: 15:58:24 Log-Likelihood: -179.62
No. Observations: 63 AIC: 369.2
Df Residuals: 58 BIC: 379.9
Df Model: 4
Covariance Type: nonrobust
=======================================================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------------------
const 1.9140 1.813 1.056 0.295 -1.714 5.542
surprisePercentage -0.0067 0.044 -0.153 0.879 -0.094 0.081
Vix - Close - Pre Day -0.0602 0.089 -0.676 0.502 -0.239 0.118
Fed Funds Rate -0.1161 0.319 -0.364 0.717 -0.754 0.522
APPL - 10 Day before announcement change - abnormal -0.1146 0.156 -0.733 0.467 -0.428 0.198
==============================================================================
Omnibus: 0.105 Durbin-Watson: 2.047
Prob(Omnibus): 0.949 Jarque-Bera (JB): 0.061
Skew: 0.064 Prob(JB): 0.970
Kurtosis: 2.918 Cond. No. 74.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
=== Model 4: 20-Day Drift ===
OLS Regression Results
=================================================================================================
Dep. Variable: APPL - 20 Day Drift Change - abnormal R-squared: 0.138
Model: OLS Adj. R-squared: 0.079
Method: Least Squares F-statistic: 2.327
Date: Fri, 31 Oct 2025 Prob (F-statistic): 0.0668
Time: 15:58:24 Log-Likelihood: -198.11
No. Observations: 63 AIC: 406.2
Df Residuals: 58 BIC: 416.9
Df Model: 4
Covariance Type: nonrobust
=======================================================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------------------
const 5.8825 2.431 2.420 0.019 1.017 10.748
surprisePercentage 0.0164 0.059 0.281 0.780 -0.101 0.134
Vix - Close - Pre Day -0.2501 0.119 -2.094 0.041 -0.489 -0.011
Fed Funds Rate -0.2891 0.428 -0.676 0.502 -1.145 0.567
APPL - 10 Day before announcement change - abnormal -0.3757 0.210 -1.791 0.078 -0.795 0.044
==============================================================================
Omnibus: 0.257 Durbin-Watson: 2.402
Prob(Omnibus): 0.879 Jarque-Bera (JB): 0.069
Skew: -0.080 Prob(JB): 0.966
Kurtosis: 3.019 Cond. No. 74.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Exploratory analysis of price changes, earnings surprises and 3 day drift change
In [3]:
import pandas as pd
In [3]:
dataset = pd.read_excel('Copy of Apple_Master_Sheet_1.xlsx')
In [4]:
dataset
Out[4]:
| Unnamed: 0 | Date | First day of month | Adj Close | Close | High | Low | Open | Volume | APPL - daily change | ... | CPI | fiscalDateEnding | reportedDate | reportedEPS | estimatedEPS | surprise | surprisePercentage | reportTime | symbol | totalRevenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2010-01-04 | 2010-01-01 | 6.424606 | 7.643214 | 7.660714 | 7.585000 | 7.622500 | 493729600 | 0.000000 | ... | 216.687 | NaT | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 1 | 2010-01-05 | 2010-01-01 | 6.435713 | 7.656429 | 7.699643 | 7.616071 | 7.664286 | 601904800 | 0.001729 | ... | 216.687 | NaT | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2 | 2010-01-06 | 2010-01-01 | 6.333344 | 7.534643 | 7.686786 | 7.526786 | 7.656429 | 552160000 | -0.015906 | ... | 216.687 | NaT | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 3 | 2010-01-07 | 2010-01-01 | 6.321636 | 7.520714 | 7.571429 | 7.466071 | 7.562500 | 477131200 | -0.001849 | ... | 216.687 | NaT | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 4 | 2010-01-08 | 2010-01-01 | 6.363664 | 7.570714 | 7.571429 | 7.466429 | 7.510714 | 447610800 | 0.006648 | ... | 216.687 | NaT | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3969 | 3969 | 2025-10-14 | 2025-10-01 | 247.770004 | 247.770004 | 248.850006 | 244.699997 | 246.600006 | 35478000 | 0.000444 | ... | NaN | NaT | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3970 | 3970 | 2025-10-15 | 2025-10-01 | 249.339996 | 249.339996 | 251.820007 | 247.470001 | 249.490005 | 33893600 | 0.006336 | ... | NaN | NaT | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3971 | 3971 | 2025-10-16 | 2025-10-01 | 247.449997 | 247.449997 | 249.039993 | 245.130005 | 248.250000 | 39777000 | -0.007580 | ... | NaN | NaT | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3972 | 3972 | 2025-10-17 | 2025-10-01 | 252.289993 | 252.289993 | 253.380005 | 247.270004 | 248.020004 | 49147000 | 0.019559 | ... | NaN | NaT | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3973 | 3973 | 2025-10-20 | 2025-10-01 | 262.239990 | 262.239990 | 264.380005 | 255.630005 | 255.889999 | 90370300 | 0.039439 | ... | NaN | NaT | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3974 rows × 45 columns
In [9]:
# Make sure Date is datetime
dataset['Date'] = pd.to_datetime(dataset['Date'])
fig, ax1 = plt.subplots(figsize=(10, 6))
# Left y-axis: Adj Close
ax1.plot(dataset['Date'], dataset['Adj Close'], color='blue', label='Adj Close')
ax1.set_xlabel('Date')
ax1.set_ylabel('Adj Close', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')
# Right y-axis: only non-NaN surprisePercentage values
ax2 = ax1.twinx()
mask = dataset['surprisePercentage'].notna()
ax2.plot(dataset.loc[mask, 'Date'],
dataset.loc[mask, 'surprisePercentage'],
color='red', marker='o', linestyle='-', linewidth=2, label='Surprise %')
ax2.set_ylabel('Surprise %', color='red')
ax2.tick_params(axis='y', labelcolor='red')
plt.title('Adj Close vs Surprise Percentage (Only Available Dates)')
lines, labels = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines + lines2, labels + labels2, loc='upper left')
plt.tight_layout()
plt.show()
In [13]:
mask = dataset['surprisePercentage'].notna() & dataset['APPL - 3 Day Drift Change - abnormal'].notna()
compare_df = dataset.loc[mask, ['Date', 'surprisePercentage', 'APPL - 3 Day Drift Change - abnormal']]
fig, ax1 = plt.subplots(figsize=(10,6))
ax1.plot(compare_df['Date'], compare_df['surprisePercentage'], color='blue', marker='o', label='Surprise %')
ax1.set_xlabel('Date')
ax1.set_ylabel('Surprise %', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')
ax2 = ax1.twinx()
ax2.plot(compare_df['Date'], compare_df['APPL - 3 Day Drift Change - abnormal'], color='red', marker='x', label='3-Day Drift Change (Abnormal)')
ax2.set_ylabel('3-Day Drift Change (Abnormal)', color='red')
ax2.tick_params(axis='y', labelcolor='red')
plt.title('Surprise % vs 3-Day Drift Change (Abnormal)')
lines, labels = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines + lines2, labels + labels2, loc='upper left')
plt.tight_layout()
plt.show()
Histograms for earnings percentages
In [4]:
appl_earnings = pd.read_csv('appl_earnings.csv')
googl_earnings = pd.read_csv('googl_earnings.csv')
nvda_earnings = pd.read_csv('nvda_earnings.csv')
In [5]:
import matplotlib.pyplot as plt
# Make the figure larger
plt.figure(figsize=(12, 4))
# Apple
plt.subplot(1, 3, 1)
plt.hist(appl_earnings['surprisePercentage'], bins=15, color='skyblue', edgecolor='black')
plt.title('AAPL: Surprise %')
plt.xlabel('Surprise Percentage')
plt.ylabel('Frequency')
# Google
plt.subplot(1, 3, 2)
plt.hist(googl_earnings['surprisePercentage'], bins=15, color='lightgreen', edgecolor='black')
plt.title('GOOGL: Surprise %')
plt.xlabel('Surprise Percentage')
# Nvidia
plt.subplot(1, 3, 3)
plt.hist(nvda_earnings['surprisePercentage'], bins=15, color='salmon', edgecolor='black')
plt.title('NVDA: Surprise %')
plt.xlabel('Surprise Percentage')
plt.tight_layout()
plt.show()
In [14]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import gaussian_kde
# Clean numeric data
appl = pd.to_numeric(appl_earnings['surprisePercentage'], errors='coerce').dropna()
googl = pd.to_numeric(googl_earnings['surprisePercentage'], errors='coerce').dropna()
nvda = pd.to_numeric(nvda_earnings['surprisePercentage'], errors='coerce').dropna()
# Combine all data for shared bins
all_data = np.concatenate([appl, googl, nvda])
# More bins (e.g., 30)
if all_data.min() == all_data.max():
bins = 10
else:
bins = np.linspace(all_data.min(), all_data.max(), 31) # 30 bins
plt.figure(figsize=(12, 7))
# Histograms (shared bins)
plt.hist(appl, bins=bins, alpha=0.4, label='AAPL', edgecolor='black')
plt.hist(googl, bins=bins, alpha=0.4, label='GOOGL', edgecolor='black')
plt.hist(nvda, bins=bins, alpha=0.4, label='NVDA', edgecolor='black')
# KDE Trend Lines
xs = np.linspace(all_data.min(), all_data.max(), 400)
appl_kde = gaussian_kde(appl)
googl_kde = gaussian_kde(googl)
nvda_kde = gaussian_kde(nvda)
plt.plot(xs, appl_kde(xs) * len(appl) * (bins[1] - bins[0]), label='AAPL Trend')
plt.plot(xs, googl_kde(xs) * len(googl) * (bins[1] - bins[0]), label='GOOGL Trend')
plt.plot(xs, nvda_kde(xs) * len(nvda) * (bins[1] - bins[0]), label='NVDA Trend')
# Title & labels
plt.title('Surprise % Distribution with KDE Trend Lines')
plt.xlabel('Surprise Percentage')
plt.ylabel('Frequency')
plt.legend()
plt.tight_layout()
plt.show()
Histograms for CAR values - Separate Windows
In [20]:
# EVENT STUDY - HISTOGRAMS BY TICKER AND WINDOW
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 1. Load the Excel file
file_path = "event_study (1).xlsx"
# Specify only the CAR sheets you need
sheets_to_load = [
'CAR_(0,1)',
'CAR_(0,3)',
'CAR_(0,5)',
'CAR_(-1,+1)',
'CAR_(-1,+5)_ROBUST'
]
# Read those sheets into a dictionary of DataFrames
car_sheets = pd.read_excel(file_path, sheet_name=sheets_to_load)
print("Sheets loaded:", list(car_sheets.keys()))
# 2. Group by ticker and view summary statistics
summary_stats = {}
for sheet_name, df in car_sheets.items():
# Make sure expected columns exist
if 'ticker' not in df.columns or 'CAR' not in df.columns:
print(f"⚠️ Skipping {sheet_name} (missing columns)")
continue
grouped = df.groupby('ticker')['CAR'].describe()
summary_stats[sheet_name] = grouped
print(f"\n=== {sheet_name} ===")
print(grouped)
# 3. Plot histograms by ticker (separate per CAR window)
for sheet_name, df in car_sheets.items():
if 'ticker' not in df.columns or 'CAR' not in df.columns:
continue
print(f"\n📊 Plotting {sheet_name}...")
tickers = df['ticker'].unique()
# One histogram per ticker
for tkr in tickers:
subset = df[df['ticker'] == tkr]
plt.figure(figsize=(6, 4))
plt.hist(subset['CAR'], bins=15, color='skyblue', edgecolor='black')
plt.title(f"{sheet_name} — {tkr}")
plt.xlabel("CAR")
plt.ylabel("Frequency")
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# 4. Combined histogram per CAR window (tickers overlaid)
for sheet_name, df in car_sheets.items():
if 'ticker' not in df.columns or 'CAR' not in df.columns:
continue
plt.figure(figsize=(8, 5))
sns.histplot(data=df, x='CAR', hue='ticker', bins=20, kde=True, element='step')
plt.title(f"Distribution of CAR by Ticker — {sheet_name}")
plt.xlabel("CAR")
plt.ylabel("Frequency")
plt.legend(title='Ticker')
plt.tight_layout()
plt.show()
# 5. (Optional) Save grouped summaries to Excel
with pd.ExcelWriter("CAR_ticker_summaries.xlsx") as writer:
for sheet_name, grouped in summary_stats.items():
grouped.to_excel(writer, sheet_name=sheet_name)
print("\n✅ Analysis complete — histograms displayed and summary file saved as 'CAR_ticker_summaries.xlsx'")
Sheets loaded: ['CAR_(0,1)', 'CAR_(0,3)', 'CAR_(0,5)', 'CAR_(-1,+1)', 'CAR_(-1,+5)_ROBUST']
=== CAR_(0,1) ===
count mean std min 25% 50% 75% \
ticker
AAPL 43.0 0.005428 0.046335 -0.089698 -0.024862 0.011972 0.035717
GOOGL 43.0 0.003760 0.052160 -0.093655 -0.034078 0.005565 0.038648
NVDA 43.0 0.026321 0.096231 -0.262444 -0.027790 0.000834 0.087351
max
ticker
AAPL 0.103840
GOOGL 0.142875
NVDA 0.242677
=== CAR_(0,3) ===
count mean std min 25% 50% 75% \
ticker
AAPL 43.0 0.006503 0.052120 -0.094895 -0.024847 0.018516 0.042477
GOOGL 43.0 0.000413 0.058373 -0.106452 -0.044546 0.001775 0.038365
NVDA 43.0 0.028975 0.107400 -0.232482 -0.047788 0.011799 0.109112
max
ticker
AAPL 0.107871
GOOGL 0.161237
NVDA 0.315025
=== CAR_(0,5) ===
count mean std min 25% 50% 75% \
ticker
AAPL 43.0 0.009665 0.056591 -0.102176 -0.031273 0.018055 0.049417
GOOGL 43.0 -0.002316 0.057561 -0.122602 -0.043850 -0.007569 0.038725
NVDA 43.0 0.024691 0.106726 -0.194454 -0.057976 0.001781 0.102786
max
ticker
AAPL 0.119699
GOOGL 0.118061
NVDA 0.321600
=== CAR_(-1,+1) ===
count mean std min 25% 50% 75% \
ticker
AAPL 43.0 0.007897 0.044772 -0.085606 -0.018616 0.015868 0.036014
GOOGL 43.0 0.007233 0.051769 -0.097168 -0.030333 0.004781 0.039020
NVDA 43.0 0.023750 0.094759 -0.260275 -0.030462 0.013586 0.077743
max
ticker
AAPL 0.113222
GOOGL 0.162359
NVDA 0.222872
=== CAR_(-1,+5)_ROBUST ===
count mean std min 25% 50% 75% \
ticker
AAPL 43.0 0.012134 0.055863 -0.093473 -0.030653 0.020592 0.055921
GOOGL 43.0 0.001158 0.055660 -0.126114 -0.037115 0.001218 0.037533
NVDA 43.0 0.022120 0.105125 -0.192285 -0.046309 0.007665 0.084850
max
ticker
AAPL 0.129082
GOOGL 0.137545
NVDA 0.294146
📊 Plotting CAR_(0,1)...
📊 Plotting CAR_(0,3)...
📊 Plotting CAR_(0,5)...
📊 Plotting CAR_(-1,+1)...
📊 Plotting CAR_(-1,+5)_ROBUST...
C:\Users\aledr\AppData\Local\Temp\ipykernel_17232\2340267492.py:83: UserWarning: No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument. plt.legend(title='Ticker')
C:\Users\aledr\AppData\Local\Temp\ipykernel_17232\2340267492.py:83: UserWarning: No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument. plt.legend(title='Ticker')
C:\Users\aledr\AppData\Local\Temp\ipykernel_17232\2340267492.py:83: UserWarning: No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument. plt.legend(title='Ticker')
C:\Users\aledr\AppData\Local\Temp\ipykernel_17232\2340267492.py:83: UserWarning: No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument. plt.legend(title='Ticker')
C:\Users\aledr\AppData\Local\Temp\ipykernel_17232\2340267492.py:83: UserWarning: No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument. plt.legend(title='Ticker')
✅ Analysis complete — histograms displayed and summary file saved as 'CAR_ticker_summaries.xlsx'
In [22]:
import matplotlib.pyplot as plt
import seaborn as sns
for sheet_name, df in car_sheets.items():
if 'ticker' not in df.columns or 'CAR' not in df.columns:
continue
# Drop rows with missing ticker or CAR values
df = df.dropna(subset=['ticker', 'CAR']).copy()
# Ensure ticker is a proper string (so Seaborn recognizes it as categorical)
df['ticker'] = df['ticker'].astype(str).str.strip()
plt.figure(figsize=(8, 5))
ax = sns.histplot(
data=df,
x='CAR',
hue='ticker',
bins=20,
kde=True,
element='step',
alpha=0.5
)
# Force the legend to show actual labels
handles, labels = ax.get_legend_handles_labels()
# If seaborn doesn’t pick up the labels, rebuild them manually
if not labels or labels == ['Ticker']:
unique_tickers = sorted(df['ticker'].unique())
handles = [plt.Line2D([0], [0], color=c, lw=4) for c in sns.color_palette(n_colors=len(unique_tickers))]
labels = unique_tickers
plt.legend(handles, labels, title='Ticker', title_fontsize=11, fontsize=10, loc='upper right')
plt.title(f"Distribution of CAR by Ticker — {sheet_name}")
plt.xlabel("CAR")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()
Share price changes Vs Markets
In [23]:
SandP_vs_share_prices = pd.read_csv('SandP - Stock Changes.csv')
In [24]:
import matplotlib.pyplot as plt
# Set up the figure and 3 subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharex=True, sharey=True)
# Scatter 1: NVDA vs S&P
axes[0].scatter(SandP_vs_share_prices['S&P_Change'], SandP_vs_share_prices['nvda_change'],
alpha=0.6, color='green', edgecolor='black')
axes[0].set_title('S&P 500 vs NVIDIA')
axes[0].set_xlabel('S&P_Change')
axes[0].set_ylabel('nvda_change')
axes[0].grid(alpha=0.3)
# Scatter 2: AAPL vs S&P
axes[1].scatter(SandP_vs_share_prices['S&P_Change'], SandP_vs_share_prices['appl_change'],
alpha=0.6, color='blue', edgecolor='black')
axes[1].set_title('S&P 500 vs Apple')
axes[1].set_xlabel('S&P_Change')
axes[1].set_ylabel('appl_change')
axes[1].grid(alpha=0.3)
# Scatter 3: GOOGL vs S&P
axes[2].scatter(SandP_vs_share_prices['S&P_Change'], SandP_vs_share_prices['goog_change'],
alpha=0.6, color='orange', edgecolor='black')
axes[2].set_title('S&P 500 vs Google')
axes[2].set_xlabel('S&P_Change')
axes[2].set_ylabel('goog_change')
axes[2].grid(alpha=0.3)
plt.tight_layout()
plt.show()
In [25]:
import pandas as pd
import statsmodels.api as sm
# Define your dataframe
df = SandP_vs_share_prices.copy()
# Define dependent variables (the three stocks)
stocks = ['nvda_change', 'appl_change', 'goog_change']
# Loop through each stock and run a regression vs. S&P_Change
for stock in stocks:
print(f"\n=== Linear Regression: {stock} vs S&P_Change ===")
# Drop missing values for the two relevant columns
data = df[['S&P_Change', stock]].dropna()
# Define X (independent variable) and y (dependent)
X = sm.add_constant(data['S&P_Change']) # adds intercept (alpha)
y = data[stock]
# Run Ordinary Least Squares regression
model = sm.OLS(y, X).fit()
# Print summary
print(model.summary())
# Extract key results
alpha = model.params['const']
beta = model.params['S&P_Change']
r2 = model.rsquared
print(f"Alpha: {alpha:.4f} | Beta: {beta:.4f} | R²: {r2:.4f}")
=== Linear Regression: nvda_change vs S&P_Change ===
OLS Regression Results
==============================================================================
Dep. Variable: nvda_change R-squared: 0.399
Model: OLS Adj. R-squared: 0.399
Method: Least Squares F-statistic: 2635.
Date: Tue, 11 Nov 2025 Prob (F-statistic): 0.00
Time: 12:28:00 Log-Likelihood: 9460.4
No. Observations: 3973 AIC: -1.892e+04
Df Residuals: 3971 BIC: -1.890e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0011 0.000 3.078 0.002 0.000 0.002
S&P_Change 1.6635 0.032 51.328 0.000 1.600 1.727
==============================================================================
Omnibus: 1584.320 Durbin-Watson: 2.039
Prob(Omnibus): 0.000 Jarque-Bera (JB): 50178.243
Skew: 1.265 Prob(JB): 0.00
Kurtosis: 20.225 Cond. No. 91.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Alpha: 0.0011 | Beta: 1.6635 | R²: 0.3988
=== Linear Regression: appl_change vs S&P_Change ===
OLS Regression Results
==============================================================================
Dep. Variable: appl_change R-squared: 0.477
Model: OLS Adj. R-squared: 0.477
Method: Least Squares F-statistic: 3621.
Date: Tue, 11 Nov 2025 Prob (F-statistic): 0.00
Time: 12:28:00 Log-Likelihood: 11649.
No. Observations: 3973 AIC: -2.329e+04
Df Residuals: 3971 BIC: -2.328e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0005 0.000 2.543 0.011 0.000 0.001
S&P_Change 1.1242 0.019 60.178 0.000 1.088 1.161
==============================================================================
Omnibus: 549.580 Durbin-Watson: 1.890
Prob(Omnibus): 0.000 Jarque-Bera (JB): 7460.253
Skew: 0.047 Prob(JB): 0.00
Kurtosis: 9.712 Cond. No. 91.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Alpha: 0.0005 | Beta: 1.1242 | R²: 0.4770
=== Linear Regression: goog_change vs S&P_Change ===
OLS Regression Results
==============================================================================
Dep. Variable: goog_change R-squared: 0.469
Model: OLS Adj. R-squared: 0.469
Method: Least Squares F-statistic: 3504.
Date: Tue, 11 Nov 2025 Prob (F-statistic): 0.00
Time: 12:28:00 Log-Likelihood: 11717.
No. Observations: 3973 AIC: -2.343e+04
Df Residuals: 3971 BIC: -2.342e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0003 0.000 1.512 0.131 -9.04e-05 0.001
S&P_Change 1.0873 0.018 59.197 0.000 1.051 1.123
==============================================================================
Omnibus: 1485.661 Durbin-Watson: 1.937
Prob(Omnibus): 0.000 Jarque-Bera (JB): 64397.851
Skew: 1.055 Prob(JB): 0.00
Kurtosis: 22.610 Cond. No. 91.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Alpha: 0.0003 | Beta: 1.0873 | R²: 0.4688
Linear and mulilinear regressions using CAR values
In [4]:
import pandas as pd
import numpy as np
from pathlib import Path
# Use statsmodels for regression with standard errors and probability values
import statsmodels.api as statsmodels_api
In [16]:
# Step 1: list sheet names so you can target the right ones
event_book = pd.ExcelFile("event_study.xlsx")
feature_book = pd.ExcelFile("features_patched.xlsx")
print("Event sheets:")
print(event_book.sheet_names)
print("\nFeature sheets:")
print(feature_book.sheet_names)
Event sheets: ['README', 'CAR_(0,1)', 'CAR_(0,3)', 'CAR_(0,5)', 'CAR_(-1,+1)', 'CAR_(-1,+5)_ROBUST', 'CAAR_Summary', 'AlphaBeta_Params'] Feature sheets: ['Sheet1', 'features']
In [18]:
# Step 2: choose the sheets
# Event windows are every sheet that starts with "CAR"
event_window_sheets = [s for s in event_book.sheet_names if str(s).upper().startswith("CAR")]
# The features live in a sheet named "features"
features_sheet = "features"
In [20]:
# Step 3: read the features sheet once
features_table = pd.read_excel("features_patched.xlsx", sheet_name=features_sheet)
In [26]:
# Step 4: set join keys and predictor
join_keys = ["ticker", "announce_date", "timing", "day0"]
predictor_col = "gap_proxy_dm1_to_d0"
# basic checks
missing_in_features = [c for c in join_keys if c not in features_table.columns]
if missing_in_features:
raise ValueError(f"Join keys missing in features sheet: {missing_in_features}")
if predictor_col not in features_table.columns:
raise ValueError(f"Missing predictor column: {predictor_col}")
In [28]:
# Step 5: helper to run one linear regression
import statsmodels.api as sm
def run_regression(y_series, x_series):
clean = pd.DataFrame({"y": y_series, "x": x_series}).dropna()
if len(clean) < 3:
return None
X = sm.add_constant(clean["x"])
model = sm.OLS(clean["y"], X).fit()
return {
"intercept": float(model.params.get("const", np.nan)),
"slope_on_gap_proxy_dm1_to_d0": float(model.params.get("x", np.nan)),
"r_squared": float(model.rsquared),
"p_value_for_slope": float(model.pvalues.get("x", np.nan)),
"std_error_for_slope": float(model.bse.get("x", np.nan)),
"rows_used": int(model.nobs),
}
In [30]:
# Step 6: loop windows, merge, regress
results = []
skipped = []
for sheet in event_window_sheets:
event_table = pd.read_excel("event_study.xlsx", sheet_name=sheet)
# must have the join keys and a CAR column
miss_event = [c for c in join_keys if c not in event_table.columns]
if miss_event:
skipped.append({"window_sheet": sheet, "reason": f"Missing join keys: {miss_event}"})
continue
if "CAR" not in event_table.columns:
skipped.append({"window_sheet": sheet, "reason": "No CAR column"})
continue
# merge
merged = pd.merge(
event_table[join_keys + ["CAR"]],
features_table[join_keys + [predictor_col]],
on=join_keys,
how="inner"
)
if merged.empty:
skipped.append({"window_sheet": sheet, "reason": "Merge produced zero rows"})
continue
# regress CAR on gap
out = run_regression(merged["CAR"], merged[predictor_col])
if out is None:
skipped.append({"window_sheet": sheet, "reason": "Too few rows after dropping missing values"})
continue
out["window_sheet"] = sheet
out["car_column"] = "CAR"
results.append(out)
results_table = pd.DataFrame(results).sort_values("window_sheet")
skipped_table = pd.DataFrame(skipped)
print("Results:")
display(results_table)
print("\nSkipped:")
display(skipped_table)
Results:
| intercept | slope_on_gap_proxy_dm1_to_d0 | r_squared | p_value_for_slope | std_error_for_slope | rows_used | window_sheet | car_column | |
|---|---|---|---|---|---|---|---|---|
| 3 | -0.004615 | 0.946683 | 0.702821 | 2.890488e-35 | 0.054625 | 129 | CAR_(-1,+1) | CAR |
| 4 | -0.006355 | 0.978147 | 0.593682 | 1.327799e-26 | 0.071806 | 129 | CAR_(-1,+5)_ROBUST | CAR |
| 0 | -0.006765 | 1.001988 | 0.754168 | 1.640008e-40 | 0.050763 | 129 | CAR_(0,1) | CAR |
| 1 | -0.007769 | 1.062884 | 0.676748 | 6.140477e-33 | 0.065184 | 129 | CAR_(0,3) | CAR |
| 2 | -0.008506 | 1.033452 | 0.634210 | 1.627027e-29 | 0.069645 | 129 | CAR_(0,5) | CAR |
Skipped:
In [35]:
# Linear regressions of CAR on gap_proxy_dm1_to_d0
# Run a separate regression for each ticker inside each CAR window.
import pandas as pd
import numpy as np
from pathlib import Path
import statsmodels.api as sm
# ----- Files -----
event_file = "event_study.xlsx"
features_file = "features_patched.xlsx"
# ----- Discover sheets -----
event_book = pd.ExcelFile(event_file)
feature_book = pd.ExcelFile(features_file)
# Pick every event window sheet that starts with "CAR"
event_window_sheets = [s for s in event_book.sheet_names if str(s).upper().startswith("CAR")]
# The features are in the "features" sheet
features_sheet = "features"
# ----- Join keys and predictor -----
join_keys = ["ticker", "announce_date", "timing", "day0"]
predictor_col = "gap_proxy_dm1_to_d0"
# ----- Load features once -----
features_table = pd.read_excel(features_file, sheet_name=features_sheet)
# Basic checks
missing_in_features = [c for c in join_keys if c not in features_table.columns]
if missing_in_features:
raise ValueError(f"Join keys missing in features sheet: {missing_in_features}")
if predictor_col not in features_table.columns:
raise ValueError(f"Missing predictor column in features sheet: {predictor_col}")
# ----- Helper: run one regression -----
def run_regression(y, x):
frame = pd.DataFrame({"y": y, "x": x}).dropna()
if len(frame) < 3:
return None
X = sm.add_constant(frame["x"])
model = sm.OLS(frame["y"], X).fit()
return {
"intercept": float(model.params.get("const", np.nan)),
"slope_on_gap_proxy_dm1_to_d0": float(model.params.get("x", np.nan)),
"r_squared": float(model.rsquared),
"p_value_for_slope": float(model.pvalues.get("x", np.nan)),
"std_error_for_slope": float(model.bse.get("x", np.nan)),
"rows_used": int(model.nobs),
}
# ----- Loop windows, then tickers -----
all_rows = []
skipped = []
for window_sheet in event_window_sheets:
event_table = pd.read_excel(event_file, sheet_name=window_sheet)
# Must have the join keys and a CAR column
missing_in_event = [c for c in join_keys if c not in event_table.columns]
if missing_in_event:
skipped.append({"window_sheet": window_sheet, "ticker": None,
"reason": f"Missing join keys in event sheet: {missing_in_event}"})
continue
if "CAR" not in event_table.columns:
skipped.append({"window_sheet": window_sheet, "ticker": None,
"reason": "No CAR column in event sheet"})
continue
# Merge event rows with features on the keys
merged = pd.merge(
event_table[join_keys + ["CAR"]],
features_table[join_keys + [predictor_col]],
on=join_keys,
how="inner"
)
if merged.empty:
skipped.append({"window_sheet": window_sheet, "ticker": None,
"reason": "Merge produced zero rows"})
continue
# Group by ticker inside this window
for ticker, grp in merged.groupby("ticker", dropna=False):
out = run_regression(grp["CAR"], grp[predictor_col])
if out is None:
skipped.append({"window_sheet": window_sheet, "ticker": ticker,
"reason": "Too few rows after removing missing values"})
continue
out["window_sheet"] = window_sheet
out["ticker"] = ticker
out["car_column"] = "CAR"
all_rows.append(out)
# ----- Build tables (robust to empty lists) -----
results_by_ticker = pd.DataFrame(all_rows)
if not results_by_ticker.empty:
results_by_ticker = results_by_ticker.sort_values(
["window_sheet", "ticker"]
).reset_index(drop=True)
else:
# create an empty frame with the expected columns
results_by_ticker = pd.DataFrame(columns=[
"window_sheet","ticker","car_column",
"intercept","slope_on_gap_proxy_dm1_to_d0",
"r_squared","p_value_for_slope","std_error_for_slope","rows_used"
])
skipped_table = pd.DataFrame(skipped)
if skipped_table.empty:
# nothing was skipped
skipped_table = pd.DataFrame(columns=["window_sheet","ticker","reason"])
else:
# make sure the sort keys exist even if some dicts missed them
for col in ["window_sheet", "ticker"]:
if col not in skipped_table.columns:
skipped_table[col] = pd.NA
skipped_table = skipped_table.sort_values(
["window_sheet", "ticker"], na_position="last"
).reset_index(drop=True)
print("Results (first rows):")
display(results_by_ticker.head(20))
print("\nSkipped (first rows):")
display(skipped_table.head(20))
Results (first rows):
| intercept | slope_on_gap_proxy_dm1_to_d0 | r_squared | p_value_for_slope | std_error_for_slope | rows_used | window_sheet | ticker | car_column | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.001128 | 0.850779 | 0.608825 | 6.886153e-10 | 0.106503 | 43 | CAR_(-1,+1) | AAPL | CAR |
| 1 | -0.004444 | 0.780576 | 0.728623 | 3.517210e-13 | 0.074397 | 43 | CAR_(-1,+1) | GOOGL | CAR |
| 2 | -0.011812 | 1.084953 | 0.737988 | 1.701914e-13 | 0.100961 | 43 | CAR_(-1,+1) | NVDA | CAR |
| 3 | 0.004423 | 0.969133 | 0.507463 | 8.430830e-08 | 0.149111 | 43 | CAR_(-1,+5)_ROBUST | AAPL | CAR |
| 4 | -0.011178 | 0.824559 | 0.703350 | 2.219041e-12 | 0.083631 | 43 | CAR_(-1,+5)_ROBUST | GOOGL | CAR |
| 5 | -0.013724 | 1.093544 | 0.609152 | 6.767499e-10 | 0.136800 | 43 | CAR_(-1,+5)_ROBUST | NVDA | CAR |
| 6 | -0.002004 | 0.934096 | 0.685250 | 7.564049e-12 | 0.098869 | 43 | CAR_(0,1) | AAPL | CAR |
| 7 | -0.008858 | 0.843446 | 0.838031 | 8.369310e-18 | 0.057910 | 43 | CAR_(0,1) | GOOGL | CAR |
| 8 | -0.010297 | 1.117168 | 0.758700 | 3.105621e-14 | 0.098394 | 43 | CAR_(0,1) | NVDA | CAR |
| 9 | -0.001606 | 1.019201 | 0.644760 | 9.300364e-11 | 0.118149 | 43 | CAR_(0,3) | AAPL | CAR |
| 10 | -0.013247 | 0.913102 | 0.784202 | 3.096975e-15 | 0.074806 | 43 | CAR_(0,3) | GOOGL | CAR |
| 11 | -0.009129 | 1.162476 | 0.659518 | 3.856087e-11 | 0.130444 | 43 | CAR_(0,3) | NVDA | CAR |
| 12 | 0.001291 | 1.052451 | 0.583153 | 2.585735e-09 | 0.138966 | 43 | CAR_(0,5) | AAPL | CAR |
| 13 | -0.015592 | 0.887429 | 0.761765 | 2.384949e-14 | 0.077506 | 43 | CAR_(0,5) | GOOGL | CAR |
| 14 | -0.012209 | 1.125759 | 0.626345 | 2.656516e-10 | 0.135795 | 43 | CAR_(0,5) | NVDA | CAR |
Skipped (first rows):
| window_sheet | ticker | reason |
|---|
In [37]:
# window_sheet, ticker, slope_on_gap_proxy_dm1_to_d0, std_error_for_slope, p_value_for_slope
import os
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
required_cols = {
"window_sheet",
"ticker",
"slope_on_gap_proxy_dm1_to_d0",
"std_error_for_slope",
"p_value_for_slope",
}
missing = required_cols - set(results_by_ticker.columns)
if missing:
raise ValueError(f"Missing columns in results_by_ticker: {missing}")
# Clean copy
res = results_by_ticker.copy()
# Make a folder for pictures
os.makedirs("figures", exist_ok=True)
In [39]:
import statsmodels.api as sm
def plot_scatter_for(window_name, ticker_name,
event_file="event_study.xlsx",
features_file="features_patched.xlsx",
features_sheet="features",
join_keys=("ticker","announce_date","timing","day0"),
predictor_col="gap_proxy_dm1_to_d0"):
# Load the two tables
event = pd.read_excel(event_file, sheet_name=window_name)
feat = pd.read_excel(features_file, sheet_name=features_sheet)
# Filter to the ticker
event = event[event["ticker"] == ticker_name]
feat = feat[feat["ticker"] == ticker_name]
# Keep only needed columns
event_use = event[list(join_keys) + ["CAR"]]
feat_use = feat[list(join_keys) + [predictor_col]]
merged = pd.merge(event_use, feat_use, on=list(join_keys), how="inner").dropna(subset=["CAR", predictor_col])
if merged.empty:
print("No merged rows for that pair.")
return
# Fit a line for the label
X = sm.add_constant(merged[predictor_col])
model = sm.OLS(merged["CAR"], X).fit()
slope = model.params.get(predictor_col, np.nan)
pval = model.pvalues.get(predictor_col, np.nan)
# Build the plot
plt.figure(figsize=(7, 5))
plt.scatter(merged[predictor_col], merged["CAR"])
# Draw the fitted line
x_line = np.linspace(merged[predictor_col].min(), merged[predictor_col].max(), 100)
y_line = model.params.get("const", 0.0) + slope * x_line
plt.plot(x_line, y_line)
plt.xlabel("gap_proxy_dm1_to_d0")
plt.ylabel("CAR")
plt.title(f"{ticker_name} — {window_name}\nSlope: {slope:.4g} | p value: {pval:.3g}")
plt.tight_layout()
out = f"figures/scatter_{window_name}_{ticker_name}.png".replace(" ", "_")
plt.savefig(out, dpi=150)
plt.show()
print(f"Rows used: {int(model.nobs)} Saved: {out}")
In [41]:
plot_scatter_for("CAR_(0,1)", "AAPL") # change the ticker to one you hold
Rows used: 43 Saved: figures/scatter_CAR_(0,1)_AAPL.png
In [43]:
plot_scatter_for("CAR_(0,1)", "GOOGL") # change the ticker to one you hold
Rows used: 43 Saved: figures/scatter_CAR_(0,1)_GOOGL.png
In [45]:
plot_scatter_for("CAR_(0,1)", "NVDA") # change the ticker to one you hold
Rows used: 43 Saved: figures/scatter_CAR_(0,1)_NVDA.png
In [49]:
plot_scatter_for("CAR_(-1,+1)", "AAPL") # change the ticker to one you hold
Rows used: 43 Saved: figures/scatter_CAR_(-1,+1)_AAPL.png
In [51]:
plot_scatter_for("CAR_(-1,+1)", "GOOGL") # change the ticker to one you hold
Rows used: 43 Saved: figures/scatter_CAR_(-1,+1)_GOOGL.png
In [53]:
plot_scatter_for("CAR_(-1,+1)", "AAPL") # change the ticker to one you hold
Rows used: 43 Saved: figures/scatter_CAR_(-1,+1)_NVDA.png
In [57]:
plot_scatter_for("CAR_(-1,+5)_ROBUST", "AAPL") # change the ticker to one you hold
Rows used: 43 Saved: figures/scatter_CAR_(-1,+5)_ROBUST_AAPL.png
In [59]:
plot_scatter_for("CAR_(-1,+5)_ROBUST", "GOOGL") # change the ticker to one you hold
Rows used: 43 Saved: figures/scatter_CAR_(-1,+5)_ROBUST_GOOGL.png
In [61]:
plot_scatter_for("CAR_(-1,+5)_ROBUST", "NVDA") # change the ticker to one you hold
Rows used: 43 Saved: figures/scatter_CAR_(-1,+5)_ROBUST_NVDA.png
In [64]:
# Multiple linear regression for AAPL, window CAR_(0,1)
# Drivers: gap_proxy_dm1_to_d0, vix_chg_5d_lag1, pre_vol_5d
import pandas as pd
import numpy as np
from pathlib import Path
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
# ----- Settings -----
event_file = "event_study.xlsx"
features_file = "features_patched.xlsx"
event_sheet = "CAR_(0,1)"
features_sheet = "features"
join_keys = ["ticker", "announce_date", "timing", "day0"]
target_col = "CAR"
ticker_filter = "AAPL"
predictors = ["gap_proxy_dm1_to_d0", "vix_chg_5d_lag1", "pre_vol_5d"]
# ----- Load -----
ev = pd.read_excel(event_file, sheet_name=event_sheet)
ft = pd.read_excel(features_file, sheet_name=features_sheet)
# Basic checks
need_ev = set(join_keys + [target_col])
need_ft = set(join_keys + predictors)
missing_ev = [c for c in need_ev if c not in ev.columns]
missing_ft = [c for c in need_ft if c not in ft.columns]
if missing_ev:
raise ValueError(f"Event sheet is missing: {missing_ev}")
if missing_ft:
raise ValueError(f"Features sheet is missing: {missing_ft}")
# ----- Merge and filter to ticker -----
merged = (
pd.merge(
ev[join_keys + [target_col]],
ft[join_keys + predictors],
on=join_keys,
how="inner"
)
.query("ticker == @ticker_filter")
.dropna(subset=[target_col] + predictors)
.copy()
)
n_rows = len(merged)
print(f"Rows for {ticker_filter} in {event_sheet}: {n_rows}")
if n_rows < 10:
print("Warning: very few rows. Treat the result as weak.")
# ----- Build design matrices -----
X = merged[predictors]
X = sm.add_constant(X) # add intercept
y = merged[target_col]
# ----- Fit model (ordinary) -----
ols = sm.OLS(y, X).fit()
# ----- Fit model (heteroskedasticity-robust, HC3) -----
ols_hc3 = sm.OLS(y, X).fit(cov_type="HC3")
# ----- Tidy tables -----
def tidy_result(res):
coefs = res.params
ses = res.bse
tvals = res.tvalues
pvals = res.pvalues
out = pd.DataFrame({
"term": coefs.index,
"coefficient": coefs.values,
"standard_error": ses.values,
"t_value": tvals.values,
"p_value": pvals.values
})
return out
tidy_ordinary = tidy_result(ols)
tidy_robust = tidy_result(ols_hc3)
# ----- R squared and sample info -----
metrics = pd.DataFrame([{
"r_squared": float(ols.rsquared),
"r_squared_adjusted": float(ols.rsquared_adj),
"r_squared_robust_same_fit": float(ols_hc3.rsquared), # same fit, different errors
"observations": int(ols.nobs)
}])
# ----- Multicollinearity check (VIF) -----
# Drop the constant for VIF calculation and rebuild array with constant first column
X_no_const = merged[predictors]
X_vif = np.column_stack([np.ones(len(X_no_const))] + [X_no_const[c].values for c in predictors])
vif_rows = []
for i, name in enumerate(["const"] + predictors):
try:
vif_val = variance_inflation_factor(X_vif, i)
except Exception:
vif_val = np.nan
vif_rows.append({"term": name, "vif": float(vif_val)})
vif_table = pd.DataFrame(vif_rows)
# ----- Show outputs -----
print("\nCoefficients (ordinary errors):")
display(tidy_ordinary)
print("\nCoefficients (robust errors, HC3):")
display(tidy_robust)
print("\nModel metrics:")
display(metrics)
print("\nVariance inflation factors:")
display(vif_table)
# ----- Save to a file -----
out_path = Path(f"AAPL_CAR_0_1_MLR.xlsx")
with pd.ExcelWriter(out_path, engine="xlsxwriter") as writer:
pd.DataFrame([{
"event_sheet": event_sheet,
"ticker": ticker_filter,
"predictors": ", ".join(predictors)
}]).to_excel(writer, sheet_name="meta", index=False)
tidy_ordinary.to_excel(writer, sheet_name="coefficients_ordinary", index=False)
tidy_robust.to_excel(writer, sheet_name="coefficients_robust", index=False)
metrics.to_excel(writer, sheet_name="metrics", index=False)
vif_table.to_excel(writer, sheet_name="vif", index=False)
print(f"\nSaved: {out_path.resolve()}")
Rows for AAPL in CAR_(0,1): 43 Coefficients (ordinary errors):
| term | coefficient | standard_error | t_value | p_value | |
|---|---|---|---|---|---|
| 0 | const | -0.004600 | 0.008614 | -0.534067 | 5.963268e-01 |
| 1 | gap_proxy_dm1_to_d0 | 0.928265 | 0.096965 | 9.573204 | 8.671299e-12 |
| 2 | vix_chg_5d_lag1 | -0.043587 | 0.023180 | -1.880392 | 6.753806e-02 |
| 3 | pre_vol_5d | 0.285117 | 0.540734 | 0.527278 | 6.009872e-01 |
Coefficients (robust errors, HC3):
| term | coefficient | standard_error | t_value | p_value | |
|---|---|---|---|---|---|
| 0 | const | -0.004600 | 0.007579 | -0.607009 | 5.438447e-01 |
| 1 | gap_proxy_dm1_to_d0 | 0.928265 | 0.109294 | 8.493307 | 2.008396e-17 |
| 2 | vix_chg_5d_lag1 | -0.043587 | 0.024225 | -1.799252 | 7.197884e-02 |
| 3 | pre_vol_5d | 0.285117 | 0.515844 | 0.552720 | 5.804554e-01 |
Model metrics:
| r_squared | r_squared_adjusted | r_squared_robust_same_fit | observations | |
|---|---|---|---|---|
| 0 | 0.712389 | 0.690265 | 0.712389 | 43 |
Variance inflation factors:
| term | vif | |
|---|---|---|
| 0 | const | 4.797689 |
| 1 | gap_proxy_dm1_to_d0 | 1.001275 |
| 2 | vix_chg_5d_lag1 | 1.007522 |
| 3 | pre_vol_5d | 1.006412 |
Saved: C:\Users\dcazo\Documents\AAPL_CAR_0_1_MLR.xlsx
In [66]:
# Visualise AAPL in window CAR_(0,1)
# Three scatter plots with best-fit lines (one per predictor)
# One observed vs predicted plot from the multiple linear regression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as statsmodels_api
# ---- Settings you can change ----
event_file = "event_study.xlsx"
features_file = "features_patched.xlsx"
event_sheet = "CAR_(0,1)"
features_sheet = "features"
ticker_to_show = "AAPL"
join_keys = ["ticker", "announce_date", "timing", "day0"]
target_column = "CAR"
predictor_columns = ["gap_proxy_dm1_to_d0", "vix_chg_5d_lag1", "pre_vol_5d"]
# ---- Load and join ----
event_table = pd.read_excel(event_file, sheet_name=event_sheet)
features_table = pd.read_excel(features_file, sheet_name=features_sheet)
merged_data = pd.merge(
event_table[join_keys + [target_column]],
features_table[join_keys + predictor_columns],
on=join_keys,
how="inner",
)
merged_data = merged_data.loc[merged_data["ticker"] == ticker_to_show]
merged_data = merged_data.dropna(subset=[target_column] + predictor_columns).copy()
if merged_data.empty:
raise ValueError("No rows found after merge and filter. Check the ticker, window, or column names.")
# ---- Helper: simple scatter with best-fit line y = a + b x ----
def scatter_with_fit(data_frame, x_name, y_name, title_text):
x_values = data_frame[x_name].to_numpy()
y_values = data_frame[y_name].to_numpy()
X_design = statsmodels_api.add_constant(x_values)
model = statsmodels_api.OLS(y_values, X_design).fit()
intercept = float(model.params[0])
slope = float(model.params[1])
r_squared = float(model.rsquared)
x_line = np.linspace(x_values.min(), x_values.max(), 100)
y_line = intercept + slope * x_line
plt.figure(figsize=(7, 5))
plt.scatter(x_values, y_values)
plt.plot(x_line, y_line)
plt.xlabel(x_name)
plt.ylabel(y_name)
plt.title(f"{title_text}\nSlope: {slope:.4g} Intercept: {intercept:.4g} R squared: {r_squared:.3f}")
plt.grid(True, linestyle="--", alpha=0.4)
plt.tight_layout()
plt.show()
# ---- Draw one plot per predictor ----
for predictor in predictor_columns:
scatter_with_fit(
merged_data,
predictor,
target_column,
title_text=f"{ticker_to_show} — {event_sheet}",
)
# ---- Observed vs predicted from the full multiple regression ----
X_full = statsmodels_api.add_constant(merged_data[predictor_columns])
y_full = merged_data[target_column].to_numpy()
mlr_model = statsmodels_api.OLS(y_full, X_full).fit()
y_pred = mlr_model.fittedvalues.to_numpy()
r2_full = float(mlr_model.rsquared)
# Best-fit line between predicted and observed (not forced to 45 degrees)
X_line = statsmodels_api.add_constant(y_pred)
line_model = statsmodels_api.OLS(y_full, X_line).fit()
line_intercept, line_slope = line_model.params
x_line = np.linspace(y_pred.min(), y_pred.max(), 100)
y_line = line_intercept + line_slope * x_line
plt.figure(figsize=(7, 5))
plt.scatter(y_pred, y_full)
plt.plot(x_line, y_line)
plt.plot(x_line, x_line, linestyle="--") # 45 degree reference
plt.xlabel("Predicted CAR")
plt.ylabel("Observed CAR")
plt.title(f"{ticker_to_show} — {event_sheet}\nObserved vs Predicted R squared: {r2_full:.3f}")
plt.grid(True, linestyle="--", alpha=0.4)
plt.tight_layout()
plt.show()
In [68]:
# Multiple linear regression for GOOGL, window CAR_(0,1)
# Drivers: gap_proxy_dm1_to_d0, pre_vol_5d, eps_surprise_pct
# This prints clean tables and draws plots inside the notebook. No files are written.
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
# ----- Settings -----
event_file = "event_study.xlsx"
features_file = "features_patched.xlsx"
event_sheet = "CAR_(0,1)"
features_sheet = "features"
join_keys = ["ticker", "announce_date", "timing", "day0"]
target_col = "CAR"
ticker_filter = "GOOGL"
predictors = ["gap_proxy_dm1_to_d0", "pre_vol_5d", "eps_surprise_pct"]
# ----- Load -----
events = pd.read_excel(event_file, sheet_name=event_sheet)
features = pd.read_excel(features_file, sheet_name=features_sheet)
# sanity checks
need_events = set(join_keys + [target_col])
need_features = set(join_keys + predictors)
miss_events = [c for c in need_events if c not in events.columns]
miss_features = [c for c in need_features if c not in features.columns]
if miss_events:
raise ValueError(f"Event sheet is missing: {miss_events}")
if miss_features:
raise ValueError(f"Features sheet is missing: {miss_features}")
# ----- Merge and filter -----
merged = (
pd.merge(events[join_keys + [target_col]],
features[join_keys + predictors],
on=join_keys, how="inner")
.query("ticker == @ticker_filter")
.dropna(subset=[target_col] + predictors)
.copy()
)
print(f"Rows for {ticker_filter} in {event_sheet}: {len(merged)}")
if len(merged) < 10:
print("Warning: very few rows. Treat the result as weak.")
# ----- Design matrices -----
X = sm.add_constant(merged[predictors]) # adds the intercept
y = merged[target_col]
# ----- Fit models -----
ols = sm.OLS(y, X).fit()
ols_hc3 = sm.OLS(y, X).fit(cov_type="HC3")
# ----- Tidy coefficient tables -----
def tidy(res):
return pd.DataFrame({
"term": res.params.index,
"coefficient": res.params.values,
"standard_error": res.bse.values,
"t_value": res.tvalues.values,
"p_value": res.pvalues.values
})
coefs_ordinary = tidy(ols)
coefs_robust = tidy(ols_hc3)
# ----- Model metrics -----
metrics = pd.DataFrame([{
"r_squared": float(ols.rsquared),
"r_squared_adjusted": float(ols.rsquared_adj),
"observations": int(ols.nobs)
}])
# ----- Variance inflation factors -----
X_for_vif = np.column_stack([np.ones(len(merged))] + [merged[c].to_numpy() for c in predictors])
vif_rows = []
for i, name in enumerate(["const"] + predictors):
try:
vif_val = variance_inflation_factor(X_for_vif, i)
except Exception:
vif_val = np.nan
vif_rows.append({"term": name, "variance_inflation_factor": float(vif_val)})
vif_table = pd.DataFrame(vif_rows)
# ----- Show tables -----
print("\nCoefficients (ordinary errors):")
display(coefs_ordinary)
print("\nCoefficients (robust errors, HC3):")
display(coefs_robust)
print("\nModel metrics:")
display(metrics)
print("\nVariance inflation factors:")
display(vif_table)
# ----- Plots inside the notebook -----
# 1) Observed versus predicted
y_hat = ols.fittedvalues.to_numpy()
plt.figure(figsize=(7,5))
plt.scatter(y_hat, y)
# best fit line between predicted and observed
X_line = sm.add_constant(y_hat)
line_model = sm.OLS(y, X_line).fit()
a2, b2 = line_model.params
xx = np.linspace(y_hat.min(), y_hat.max(), 100)
yy = a2 + b2 * xx
plt.plot(xx, yy)
# 45-degree reference
plt.plot(xx, xx, linestyle="--")
plt.xlabel("Predicted CAR")
plt.ylabel("Observed CAR")
plt.title(f"{ticker_filter} — {event_sheet}\nObserved versus Predicted R squared: {ols.rsquared:.3f}")
plt.tight_layout()
plt.show()
# 2) Residuals versus fitted
resid = ols.resid.to_numpy()
plt.figure(figsize=(7,5))
plt.scatter(y_hat, resid)
plt.axhline(0.0, linestyle="--")
plt.xlabel("Predicted CAR")
plt.ylabel("Residual")
plt.title(f"{ticker_filter} — {event_sheet}\nResiduals versus Predicted")
plt.tight_layout()
plt.show()
# 3) Quantile–quantile plot of residuals
sm.qqplot(resid, line="45")
plt.title(f"{ticker_filter} — {event_sheet}\nResiduals quantile–quantile")
plt.tight_layout()
plt.show()
Rows for GOOGL in CAR_(0,1): 43 Coefficients (ordinary errors):
| term | coefficient | standard_error | t_value | p_value | |
|---|---|---|---|---|---|
| 0 | const | -0.013904 | 0.007469 | -1.861613 | 7.020632e-02 |
| 1 | gap_proxy_dm1_to_d0 | 0.839998 | 0.060735 | 13.830488 | 1.268989e-16 |
| 2 | pre_vol_5d | 0.315221 | 0.434983 | 0.724674 | 4.729774e-01 |
| 3 | eps_surprise_pct | 0.000550 | 0.017200 | 0.031958 | 9.746682e-01 |
Coefficients (robust errors, HC3):
| term | coefficient | standard_error | t_value | p_value | |
|---|---|---|---|---|---|
| 0 | const | -0.013904 | 0.007046 | -1.973195 | 4.847331e-02 |
| 1 | gap_proxy_dm1_to_d0 | 0.839998 | 0.075371 | 11.144780 | 7.593171e-29 |
| 2 | pre_vol_5d | 0.315221 | 0.460504 | 0.684513 | 4.936511e-01 |
| 3 | eps_surprise_pct | 0.000550 | 0.013306 | 0.041310 | 9.670484e-01 |
Model metrics:
| r_squared | r_squared_adjusted | observations | |
|---|---|---|---|
| 0 | 0.840422 | 0.828146 | 43 |
Variance inflation factors:
| term | variance_inflation_factor | |
|---|---|---|
| 0 | const | 5.130352 |
| 1 | gap_proxy_dm1_to_d0 | 1.061984 |
| 2 | pre_vol_5d | 1.088619 |
| 3 | eps_surprise_pct | 1.150988 |
In [70]:
# Scatter plots with best fit lines for GOOGL in window CAR_(0,1)
# One plot per driver and one full model plot
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
# ---- Settings you can change ----
event_file = "event_study.xlsx"
features_file = "features_patched.xlsx"
event_sheet = "CAR_(0,1)"
features_sheet = "features"
ticker = "GOOGL"
join_keys = ["ticker", "announce_date", "timing", "day0"]
target_col = "CAR"
predictors = ["gap_proxy_dm1_to_d0", "pre_vol_5d", "eps_surprise_pct"]
# ---- Load and merge ----
events = pd.read_excel(event_file, sheet_name=event_sheet)
features = pd.read_excel(features_file, sheet_name=features_sheet)
data = (
pd.merge(
events[join_keys + [target_col]],
features[join_keys + predictors],
on=join_keys,
how="inner",
)
.query("ticker == @ticker")
.dropna(subset=[target_col] + predictors)
.copy()
)
if data.empty:
raise ValueError("No rows after merge and filter. Check the ticker, window, or column names.")
# ---- Helper: scatter with best fit line y = a + b x ----
def scatter_with_fit(df, x_name, y_name, title_text):
x = df[x_name].to_numpy()
y = df[y_name].to_numpy()
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
a, b = model.params # intercept, slope
x_line = np.linspace(x.min(), x.max(), 100)
y_line = a + b * x_line
plt.figure(figsize=(7, 5))
plt.scatter(x, y)
plt.plot(x_line, y_line)
plt.xlabel(x_name)
plt.ylabel(y_name)
plt.title(f"{title_text}\nSlope: {b:.4g} Intercept: {a:.4g} R squared: {model.rsquared:.3f}")
plt.grid(True, linestyle="--", alpha=0.4)
plt.tight_layout()
plt.show()
# ---- One plot per driver ----
for xcol in predictors:
scatter_with_fit(data, xcol, target_col, f"{ticker} — {event_sheet}")
# ---- Full model: observed versus predicted ----
X_full = sm.add_constant(data[predictors])
y_full = data[target_col].to_numpy()
mlr = sm.OLS(y_full, X_full).fit()
y_hat = mlr.fittedvalues.to_numpy()
# Best fit line between predicted and observed (not forced to forty five degrees)
X_line = sm.add_constant(y_hat)
line_model = sm.OLS(y_full, X_line).fit()
a2, b2 = line_model.params
xx = np.linspace(y_hat.min(), y_hat.max(), 100)
yy = a2 + b2 * xx
plt.figure(figsize=(7, 5))
plt.scatter(y_hat, y_full)
plt.plot(xx, yy)
plt.plot(xx, xx, linestyle="--") # forty five degree reference
plt.xlabel("Predicted CAR")
plt.ylabel("Observed CAR")
plt.title(f"{ticker} — {event_sheet}\nObserved versus Predicted R squared: {mlr.rsquared:.3f}")
plt.grid(True, linestyle="--", alpha=0.4)
plt.tight_layout()
plt.show()
In [ ]:
In [1]:
# === Setup ===
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
# === Paths ===
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), # notebook folder
Path("/mnt/data"), # uploaded files fallback
]
FEATURE_FILES = ["features v1.xlsx", "features v2.xlsx", "features v3.xlsx"]
EVENT_FILE = "event_study.xlsx"
# Optional manual overrides if names are odd
MANUAL_COLNAMES = {
# "features v1.xlsx": {"day0": "day0", "ticker": "ticker"},
# "features v2.xlsx": {"day0": "day0", "ticker": "ticker"},
# "features v3.xlsx": {"day0": "day0", "ticker": "ticker"},
# "event_study.xlsx": {"day0": "day0", "ticker": "ticker"},
}
# === Helpers ===
def find_file(filename):
for base in BASE_DIRS:
p = base / filename
if p.exists():
return p
return None
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
"""Skip readme sheets. Pick the sheet with the most numeric columns, then most rows."""
candidates = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not candidates:
return max(book, key=lambda n: len(book[n]))
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
name, _ = max(candidates, key=score)
return name
def find_event_window_sheets(book: dict):
"""Map each window to its sheet by name pattern."""
sheet_map = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for name in book.keys():
if is_readme_sheet(name):
continue
for w, pat in pats.items():
if sheet_map[w] is None and pat.search(str(name)):
sheet_map[w] = name
return sheet_map
def find_day0_column(df: pd.DataFrame) -> str | None:
cols = [str(c) for c in df.columns]
strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
if strict:
return strict[0]
fallbacks = [
"event_date","EventDate","EVENT_DATE","eventDate",
"announcement_date","AnnouncementDate","ANNOUNCEMENT_DATE","ann_date","AnnDate",
"date","Date","DATE","trading_date","TradingDate",
"day0date","date0","Date0","DATE0"
]
for name in fallbacks:
if name in df.columns:
return name
best, best_nonnull = None, -1
for c in df.columns:
s = pd.to_datetime(df[c], errors="coerce")
nonnull = int(s.notna().sum())
if nonnull > best_nonnull:
best, best_nonnull = c, nonnull
return best if best_nonnull > 0 else None
def find_ticker_column(df: pd.DataFrame) -> str | None:
tickers = [
"ticker","Ticker","symbol","Symbol","ric","RIC","permno","PERMNO",
"isin","ISIN","cusip","CUSIP","sedol","SEDOL"
]
for name in tickers:
if name in df.columns:
return name
# last resort: pick a non-numeric column with many unique short codes
obj_cols = df.select_dtypes(include=["object"]).columns
best, best_score = None, -1
for c in obj_cols:
s = df[c].astype(str).str.strip()
uniq = s.nunique()
avg_len = s.str.len().mean()
score = uniq - 0.1*avg_len
if uniq > 50 and score > best_score:
best, best_score = c, score
return best
def normalize_day0(s: pd.Series) -> pd.Series:
# pick the parse that yields more valid dates
d1 = pd.to_datetime(s, errors="coerce").dt.normalize()
d2 = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
use = d2.where(d2.notna(), d1)
return use
def normalize_ticker(s: pd.Series) -> pd.Series:
return s.astype(str).str.strip().str.upper()
def coerce_numeric(df: pd.DataFrame) -> pd.DataFrame:
out = df.copy()
for c in out.columns:
if out[c].dtype == "object":
try:
out[c] = pd.to_numeric(out[c], errors="raise")
except Exception:
pass
return out
def find_target_column_event(df: pd.DataFrame) -> str | None:
cols = list(df.columns)
pri = [c for c in cols if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if pri:
return pri[0]
sec = [c for c in cols if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
if sec:
return sec[0]
# last resort: the only numeric column left besides keys
return None
def aggregate_features_by_keys(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str) -> pd.DataFrame:
"""
One row per [day0, ticker].
Aggregate numeric predictors by mean.
Keep the key columns.
"""
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
grouped = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
return grouped, num_cols # return numeric predictors list for later filtering
def build_X_from_features_only(merged: pd.DataFrame, numeric_feature_cols: list, target_col: str) -> pd.DataFrame:
"""
Use numeric predictors that came from the features sheet.
Drop the target if it shares a name.
Drop zero variance columns.
"""
keep_cols = [c for c in numeric_feature_cols if c in merged.columns]
X = merged.loc[:, keep_cols].copy()
X = X.drop(columns=[target_col], errors="ignore")
nunique = X.nunique(dropna=False)
X = X.loc[:, nunique > 1]
return X
def fit_and_score(X: pd.DataFrame, y: pd.Series, k_max=5, random_state=42):
data = pd.concat([y, X], axis=1).dropna()
y_clean = data.iloc[:, 0]
X_clean = data.iloc[:, 1:]
n_rows = len(y_clean)
n_feat = X_clean.shape[1]
if n_feat == 0 or n_rows < max(10, n_feat + 2):
return {"rows_used": int(n_rows), "features_used": int(n_feat),
"r_squared": np.nan, "adjusted_r_squared": np.nan, "cross_validated_r_squared": np.nan}
model = LinearRegression()
model.fit(X_clean.values, y_clean.values)
r2 = float(model.score(X_clean.values, y_clean.values))
n = float(n_rows)
p = float(n_feat)
adj = 1.0 - (1.0 - r2) * (n - 1.0) / (n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
splits = min(k_max, n_rows)
if splits < 3:
cv_r2 = np.nan
else:
kf = KFold(n_splits=splits if splits <= 5 else 5, shuffle=True, random_state=random_state)
cv_scores = cross_val_score(LinearRegression(), X_clean.values, y_clean.values, cv=kf, scoring="r2")
cv_r2 = float(np.nanmean(cv_scores))
return {"rows_used": int(n_rows), "features_used": int(n_feat),
"r_squared": r2, "adjusted_r_squared": adj, "cross_validated_r_squared": cv_r2}
# === Load event study and map windows ===
evt_path = find_file(EVENT_FILE)
if evt_path is None:
raise FileNotFoundError(f"Could not find {EVENT_FILE}")
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
window_to_sheet = find_event_window_sheets(evt_book)
if not any(window_to_sheet.values()):
raise ValueError("Could not detect event study sheets for 0,1 0,3 0,5.")
# === Main loop with join on [day0, ticker] ===
merge_log = []
rows = []
for feat_name in FEATURE_FILES:
fpath = find_file(feat_name)
if fpath is None:
print(f"Warning: {feat_name} not found. Skipping.")
continue
feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
feat_sheet = choose_features_sheet(feat_book)
df_feat_raw = feat_book[feat_sheet].copy()
# resolve column names
day0_feat = MANUAL_COLNAMES.get(feat_name, {}).get("day0") or find_day0_column(df_feat_raw)
ticker_feat = MANUAL_COLNAMES.get(feat_name, {}).get("ticker") or find_ticker_column(df_feat_raw)
if day0_feat is None or ticker_feat is None:
print(f"\n{feat_name}: could not find day0 or ticker. Found day0={day0_feat}, ticker={ticker_feat}.")
continue
# aggregate to one row per key
feat_agg, numeric_cols = aggregate_features_by_keys(df_feat_raw, day0_feat, ticker_feat)
for window_key in ["0,1", "0,3", "0,5"]:
evt_sheet = window_to_sheet.get(window_key)
if evt_sheet is None:
print(f"{feat_name} | window {window_key}: no event sheet.")
continue
df_evt = evt_book[evt_sheet].copy()
# resolve event columns
day0_evt = MANUAL_COLNAMES.get(EVENT_FILE, {}).get("day0") or find_day0_column(df_evt)
ticker_evt = MANUAL_COLNAMES.get(EVENT_FILE, {}).get("ticker") or find_ticker_column(df_evt)
y_col = find_target_column_event(df_evt)
if day0_evt is None or ticker_evt is None or y_col is None:
print(f"{feat_name} | window {window_key}: cannot resolve columns. day0={day0_evt}, ticker={ticker_evt}, target={y_col}")
continue
# normalise and dedupe event by keys
evt_clean = df_evt.copy()
evt_clean["__day0__"] = normalize_day0(evt_clean[day0_evt])
evt_clean["__ticker__"] = normalize_ticker(evt_clean[ticker_evt])
evt_targets = evt_clean[["__day0__","__ticker__", y_col]].dropna(subset=["__day0__","__ticker__", y_col])
# drop duplicates on keys, keep first target for that key
dup_evt = int(evt_targets.duplicated(subset=["__day0__","__ticker__"]).sum())
evt_targets = evt_targets.drop_duplicates(subset=["__day0__","__ticker__"], keep="first")
# join on both keys
merged = feat_agg.merge(evt_targets, on=["__day0__","__ticker__"], how="inner")
merged_rows = len(merged)
missing_predictors_rows = int(pd.concat([merged[[y_col]], merged[numeric_cols]], axis=1).isna().any(axis=1).sum())
# build predictors
X = build_X_from_features_only(merged, numeric_cols, target_col=y_col)
if X.shape[1] == 0 or merged_rows == 0:
print(f"{feat_name} | window {window_key}: zero predictors or zero rows after merge.")
continue
y = merged[y_col]
metrics = fit_and_score(X, y)
merge_log.append({
"features_file": feat_name,
"features_sheet": feat_sheet,
"event_sheet": evt_sheet,
"window": window_key,
"day0_features_col": day0_feat,
"ticker_features_col": ticker_feat,
"day0_event_col": day0_evt,
"ticker_event_col": ticker_evt,
"duplicates_in_event_for_keys": dup_evt,
"rows_in_features_after_groupby": len(feat_agg),
"rows_after_merge": merged_rows,
"rows_dropped_due_to_missing_predictors_or_target": missing_predictors_rows,
"predictors_used": metrics["features_used"],
"target_col": y_col,
})
rows.append({
"features_file": feat_name,
"features_sheet": feat_sheet,
"event_sheet": evt_sheet,
"window": window_key,
"rows_used": metrics["rows_used"],
"features_used": metrics["features_used"],
"r_squared": metrics["r_squared"],
"adjusted_r_squared": metrics["adjusted_r_squared"],
"cross_validated_r_squared": metrics["cross_validated_r_squared"],
})
# === Show results ===
from IPython.display import display
pd.set_option("display.max_columns", None)
log_df = pd.DataFrame(merge_log)
res_df = pd.DataFrame(rows)
if not log_df.empty:
print("\nMerge audit (joined on day0 + ticker, features deduplicated by mean within keys):")
display(log_df)
if res_df.empty:
print("\nNo models were fit. Set MANUAL_COLNAMES at the top if the column names are unusual.")
else:
order = {"0,1": 0, "0,3": 1, "0,5": 2}
res_df["window_order"] = res_df["window"].map(order).fillna(99)
res_df = res_df.sort_values(["window_order", "features_file"]).drop(columns=["window_order"])
print("\nDetailed results (one row per features set and window):")
display(res_df.reset_index(drop=True))
print("\nComparison table (rows are windows, columns are metrics per features set):")
wide = res_df.pivot_table(index=["window"],
columns="features_file",
values=["r_squared", "adjusted_r_squared", "cross_validated_r_squared"],
aggfunc="first")
display(wide)
print("\nBest by adjusted coefficient of determination within each window:")
for w in ["0,1", "0,3", "0,5"]:
block = res_df[res_df["window"] == w]
if not block.empty:
top = block.sort_values("adjusted_r_squared", ascending=False).iloc[0]
print(f" Window {w}: {top['features_file']} adjusted={top['adjusted_r_squared']:.4f} cross_validated={top['cross_validated_r_squared']:.4f}")
Merge audit (joined on day0 + ticker, features deduplicated by mean within keys):
| features_file | features_sheet | event_sheet | window | day0_features_col | ticker_features_col | day0_event_col | ticker_event_col | duplicates_in_event_for_keys | rows_in_features_after_groupby | rows_after_merge | rows_dropped_due_to_missing_predictors_or_target | predictors_used | target_col | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | features v1.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 0 | 129 | 129 | 0 | 16 | CAR |
| 1 | features v1.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 0 | 129 | 129 | 0 | 16 | CAR |
| 2 | features v1.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 0 | 129 | 129 | 0 | 16 | CAR |
| 3 | features v2.xlsx | data | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 0 | 129 | 129 | 0 | 24 | CAR |
| 4 | features v2.xlsx | data | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 0 | 129 | 129 | 0 | 24 | CAR |
| 5 | features v2.xlsx | data | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 0 | 129 | 129 | 0 | 24 | CAR |
| 6 | features v3.xlsx | data | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 0 | 129 | 129 | 0 | 41 | CAR |
| 7 | features v3.xlsx | data | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 0 | 129 | 129 | 0 | 41 | CAR |
| 8 | features v3.xlsx | data | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 0 | 129 | 129 | 0 | 41 | CAR |
Detailed results (one row per features set and window):
| features_file | features_sheet | event_sheet | window | rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | features v1.xlsx | features | CAR_(0,1) | 0,1 | 129 | 16 | 0.303485 | 0.203983 | -0.053040 |
| 1 | features v2.xlsx | data | CAR_(0,1) | 0,1 | 129 | 24 | 0.335126 | 0.181693 | -0.135035 |
| 2 | features v3.xlsx | data | CAR_(0,1) | 0,1 | 129 | 41 | 0.452112 | 0.193912 | -0.347043 |
| 3 | features v1.xlsx | features | CAR_(0,3) | 0,3 | 129 | 16 | 0.250824 | 0.143799 | -0.147430 |
| 4 | features v2.xlsx | data | CAR_(0,3) | 0,3 | 129 | 24 | 0.272953 | 0.105174 | -0.263206 |
| 5 | features v3.xlsx | data | CAR_(0,3) | 0,3 | 129 | 41 | 0.403288 | 0.122078 | -0.516824 |
| 6 | features v1.xlsx | features | CAR_(0,5) | 0,5 | 129 | 16 | 0.257400 | 0.151314 | -0.108714 |
| 7 | features v2.xlsx | data | CAR_(0,5) | 0,5 | 129 | 24 | 0.273454 | 0.105789 | -0.202303 |
| 8 | features v3.xlsx | data | CAR_(0,5) | 0,5 | 129 | 41 | 0.414750 | 0.138942 | -0.429681 |
Comparison table (rows are windows, columns are metrics per features set):
| adjusted_r_squared | cross_validated_r_squared | r_squared | |||||||
|---|---|---|---|---|---|---|---|---|---|
| features_file | features v1.xlsx | features v2.xlsx | features v3.xlsx | features v1.xlsx | features v2.xlsx | features v3.xlsx | features v1.xlsx | features v2.xlsx | features v3.xlsx |
| window | |||||||||
| 0,1 | 0.203983 | 0.181693 | 0.193912 | -0.053040 | -0.135035 | -0.347043 | 0.303485 | 0.335126 | 0.452112 |
| 0,3 | 0.143799 | 0.105174 | 0.122078 | -0.147430 | -0.263206 | -0.516824 | 0.250824 | 0.272953 | 0.403288 |
| 0,5 | 0.151314 | 0.105789 | 0.138942 | -0.108714 | -0.202303 | -0.429681 | 0.257400 | 0.273454 | 0.414750 |
Best by adjusted coefficient of determination within each window: Window 0,1: features v1.xlsx adjusted=0.2040 cross_validated=-0.0530 Window 0,3: features v1.xlsx adjusted=0.1438 cross_validated=-0.1474 Window 0,5: features v1.xlsx adjusted=0.1513 cross_validated=-0.1087
In [2]:
# --- MLR visualisations: join on [day0 + ticker], features-only predictors ---
# If needed first run:
# !pip install pandas numpy scikit-learn matplotlib openpyxl
from pathlib import Path
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score, cross_val_predict
# ====== CONFIG ======
DATA_DIR = Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data")
EVENT_FILE = DATA_DIR / "event_study.xlsx"
FEATURE_FILES = [
DATA_DIR / "features v1.xlsx",
DATA_DIR / "features v2.xlsx",
DATA_DIR / "features v3.xlsx",
]
WINDOWS = ["0,1","0,3","0,5"] # CAR windows to use
BEST_SET_FOR_SCATTERS = "features v1.xlsx" # pick which features set to show in the scatter plots
SAVE_FIGS = False # set True if you want PNGs saved next to this notebook
# ====== HELPERS ======
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))
def choose_features_sheet(book: dict) -> str:
# pick the non-readme sheet with most numeric cols, then most rows
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands:
return next(iter(book))
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_day0_column(df: pd.DataFrame) -> str | None:
cols = [str(c) for c in df.columns]
strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, re.I)]
if strict: return strict[0]
for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
best, kbest = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > kbest: best, kbest = c, k
return best
def find_ticker_column(df: pd.DataFrame) -> str | None:
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
# fallback: likely code column
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score: best, score = c, sc
return best
def normalize_day0(s: pd.Series) -> pd.Series:
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series) -> pd.Series:
return s.astype(str).str.strip().str.upper()
def find_event_window_sheets(book: dict):
out = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.I),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.I),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.I),
}
for name in book:
if is_readme_sheet(name):
continue
for w, pat in pats.items():
if out[w] is None and pat.search(str(name)):
out[w] = name
return out
def find_target_column_event(df: pd.DataFrame) -> str | None:
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
# drop zero-variance
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def metrics_and_predictions(X: pd.DataFrame, y: pd.Series):
data = pd.concat([y, X], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
n, p = len(y_c), X_c.shape[1]
if p == 0 or n < max(10, p+2):
return {"rows": int(n), "p": int(p), "r2": np.nan, "adj": np.nan, "cv": np.nan,
"y": pd.Series(dtype=float), "yhat_cv": pd.Series(dtype=float)}
model = LinearRegression().fit(X_c.values, y_c.values)
r2 = float(model.score(X_c.values, y_c.values))
adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
kf = KFold(n_splits=min(5, n), shuffle=True, random_state=42)
cv = float(np.mean(cross_val_score(LinearRegression(), X_c.values, y_c.values, cv=kf, scoring="r2")))
yhat_cv = pd.Series(cross_val_predict(LinearRegression(), X_c.values, y_c.values, cv=kf), index=y_c.index)
return {"rows": int(n), "p": int(p), "r2": r2, "adj": adj, "cv": cv, "y": y_c, "yhat_cv": yhat_cv}
# ====== LOAD ======
evt_book = pd.read_excel(EVENT_FILE, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
features_data = {}
for fpath in FEATURE_FILES:
book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(book)
raw = book[fsheet].copy()
dcol = find_day0_column(raw)
tcol = find_ticker_column(raw)
grouped, num_cols = aggregate_features(raw, dcol, tcol)
features_data[fpath.name] = {"grouped": grouped, "num_cols": num_cols, "sheet": fsheet, "day0": dcol, "ticker": tcol}
# ====== METRICS ======
rows = []
preds = {} # (features, window) -> (y, yhat_cv)
for w in WINDOWS:
esheet = win_map.get(w)
if esheet is None:
raise ValueError(f"Could not find event sheet for window {w}.")
df_evt = evt_book[esheet].copy()
d0_evt = find_day0_column(df_evt)
tk_evt = find_ticker_column(df_evt)
ycol = find_target_column_event(df_evt)
evt = df_evt.copy()
evt["__day0__"] = normalize_day0(evt[d0_evt])
evt["__ticker__"] = normalize_ticker(evt[tk_evt])
evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])
for fname, pack in features_data.items():
merged = pack["grouped"].merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
X = build_X(merged, pack["num_cols"], ycol)
y = merged[ycol]
m = metrics_and_predictions(X, y)
rows.append({
"features_file": fname, "window": w,
"rows_used": m["rows"], "features_used": m["p"],
"r_squared": m["r2"], "adjusted_r_squared": m["adj"], "cross_validated_r_squared": m["cv"]
})
preds[(fname, w)] = (m["y"], m["yhat_cv"])
metrics = pd.DataFrame(rows).sort_values(["window","features_file"]).reset_index(drop=True)
display(metrics)
# Save metrics (optional)
out_csv = DATA_DIR / "mlr_metrics_by_features_and_window.csv"
metrics.to_csv(out_csv, index=False)
print(f"Saved metrics to: {out_csv}")
# ====== PLOTS ======
# 1) Bars: cross-validated R^2 by features set, for each window
for w in WINDOWS:
sub = metrics[metrics["window"] == w]
if sub.empty:
continue
plt.figure()
plt.bar(sub["features_file"], sub["cross_validated_r_squared"])
plt.title(f"Cross Validated R Squared by Features Set (Window {w})")
plt.xlabel("Features Set")
plt.ylabel("Cross Validated R Squared")
plt.xticks(rotation=30, ha="right")
plt.tight_layout()
if SAVE_FIGS:
plt.savefig(DATA_DIR / f"cv_r2_bar_window_{w.replace(',','_')}.png", dpi=160)
plt.show()
# 2) Lines: adjusted R^2, cross-validated R^2, and R^2 across windows
for metric in ["adjusted_r_squared", "cross_validated_r_squared", "r_squared"]:
plt.figure()
for fname in features_data.keys():
xs, ys = [], []
for w in WINDOWS:
row = metrics[(metrics["features_file"] == fname) & (metrics["window"] == w)]
if not row.empty:
xs.append(w)
ys.append(float(row.iloc[0][metric]))
if ys:
plt.plot(xs, ys, marker="o", label=fname)
plt.title(metric.replace("_"," ").title() + " Across Windows")
plt.xlabel("Window")
plt.ylabel(metric.replace("_"," ").title())
plt.legend()
plt.tight_layout()
if SAVE_FIGS:
plt.savefig(DATA_DIR / f"{metric}_across_windows.png", dpi=160)
plt.show()
# 3) Scatters: out-of-sample predictions vs actual, for the chosen features set, per window
for w in WINDOWS:
y, yhat = preds.get((BEST_SET_FOR_SCATTERS, w), (pd.Series(dtype=float), pd.Series(dtype=float)))
if y.empty:
continue
plt.figure()
plt.scatter(y, yhat, alpha=0.7)
# 45-degree line
mn = float(min(y.min(), yhat.min()))
mx = float(max(y.max(), yhat.max()))
plt.plot([mn, mx], [mn, mx])
plt.title(f"Out-of-sample Predictions vs Actual (Window {w}) — {BEST_SET_FOR_SCATTERS}")
plt.xlabel("Actual CAR")
plt.ylabel("Predicted CAR (Cross Validated)")
plt.tight_layout()
if SAVE_FIGS:
plt.savefig(DATA_DIR / f"scatter_cv_{BEST_SET_FOR_SCATTERS.replace(' ','_')}_{w.replace(',','_')}.png", dpi=160)
plt.show()
# 4) Bars: adjusted R^2 vs number of predictors (per window)
for w in WINDOWS:
sub = metrics[metrics["window"] == w].copy()
if sub.empty:
continue
plt.figure()
plt.bar(sub["features_used"].astype(int).astype(str), sub["adjusted_r_squared"])
plt.title(f"Adjusted R Squared by Number of Predictors (Window {w})")
plt.xlabel("Number of Predictors")
plt.ylabel("Adjusted R Squared")
plt.tight_layout()
if SAVE_FIGS:
plt.savefig(DATA_DIR / f"adj_r2_vs_nfeatures_{w.replace(',','_')}.png", dpi=160)
plt.show()
| features_file | window | rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | |
|---|---|---|---|---|---|---|---|
| 0 | features v1.xlsx | 0,1 | 129 | 16 | 0.303485 | 0.203983 | -0.053040 |
| 1 | features v2.xlsx | 0,1 | 129 | 24 | 0.335126 | 0.181693 | -0.135035 |
| 2 | features v3.xlsx | 0,1 | 129 | 41 | 0.452112 | 0.193912 | -0.347043 |
| 3 | features v1.xlsx | 0,3 | 129 | 16 | 0.250824 | 0.143799 | -0.147430 |
| 4 | features v2.xlsx | 0,3 | 129 | 24 | 0.272953 | 0.105174 | -0.263206 |
| 5 | features v3.xlsx | 0,3 | 129 | 41 | 0.403288 | 0.122078 | -0.516824 |
| 6 | features v1.xlsx | 0,5 | 129 | 16 | 0.257400 | 0.151314 | -0.108714 |
| 7 | features v2.xlsx | 0,5 | 129 | 24 | 0.273454 | 0.105789 | -0.202303 |
| 8 | features v3.xlsx | 0,5 | 129 | 41 | 0.414750 | 0.138942 | -0.429681 |
Saved metrics to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data\mlr_metrics_by_features_and_window.csv
In [3]:
# === Feature pruning with grouped cross validation (ticker-aware) ===
# Join on [day0 + ticker]. Use features-only predictors. No columns from event study besides target.
# Outputs:
# - Baseline cross validated coefficient of determination for each features set and window
# - Leave-one-feature-out deltas (how much each feature helps or hurts)
# - Suggested drop list (features that hurt)
# - New score after dropping suggested features
# - Lasso stability selection frequency (how often a feature survives lasso across folds)
#
# If needed first run:
# !pip install pandas numpy scikit-learn openpyxl
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold, KFold
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# ====== CONFIG ======
DATA_DIR = Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data")
EVENT_FILE = DATA_DIR / "event_study.xlsx"
FEATURE_FILES = [
DATA_DIR / "features v1.xlsx",
DATA_DIR / "features v2.xlsx",
DATA_DIR / "features v3.xlsx",
]
WINDOWS = ["0,1","0,3","0,5"]
NEGATIVE_DELTA_THRESHOLD = 0.005 # drop a feature if removing it improves cross validated coefficient of determination by at least this much
MAX_FOLDS = 5 # up to five folds for grouped cross validation
RANDOM_STATE = 42
# ====== HELPERS ======
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands:
return next(iter(book))
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_day0_column(df: pd.DataFrame) -> str | None:
cols = [str(c) for c in df.columns]
strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","EVENT_DATE","eventDate",
"announcement_date","AnnouncementDate","ANNOUNCEMENT_DATE","ann_date","AnnDate",
"date","Date","DATE","trading_date","TradingDate",
"day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
# fallback: most date-like
best, best_nonnull = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > best_nonnull:
best, best_nonnull = c, k
return best
def find_ticker_column(df: pd.DataFrame) -> str | None:
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns:
return c
# fallback: likely code column
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score:
best, score = c, sc
return best
def normalize_day0(s: pd.Series) -> pd.Series:
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series) -> pd.Series:
return s.astype(str).str.strip().str.upper()
def find_event_window_sheets(book: dict):
out = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for name in book:
if is_readme_sheet(name):
continue
for w, pat in pats.items():
if out[w] is None and pat.search(str(name)):
out[w] = name
return out
def find_target_column_event(df: pd.DataFrame) -> str | None:
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
# drop zero-variance
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def grouped_cv_r2(model, X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
n_groups = int(groups.nunique())
n_splits = max(3, min(max_folds, n_groups)) # at least three splits
gkf = GroupKFold(n_splits=n_splits)
scores = []
for tr, te in gkf.split(X, y, groups=groups):
model.fit(X.iloc[tr].values, y.iloc[tr].values)
r2 = model.score(X.iloc[tr].values, y.iloc[tr].values) # in-sample on train
# we want out-of-sample on test:
y_pred = model.predict(X.iloc[te].values)
# coefficient of determination on test fold
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
r2_test = 1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan
scores.append(r2_test)
return float(np.nanmean(scores))
def leave_one_feature_out_deltas(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
base = grouped_cv_r2(LinearRegression(), X, y, groups, max_folds=max_folds)
rows = []
for col in X.columns:
X_drop = X.drop(columns=[col])
r2_drop = grouped_cv_r2(LinearRegression(), X_drop, y, groups, max_folds=max_folds)
delta = base - r2_drop # positive = feature helps; negative = feature hurts
rows.append({"feature": col, "base_cross_validated_r_squared": base,
"cross_validated_r_squared_without_feature": r2_drop,
"delta": delta})
out = pd.DataFrame(rows).sort_values("delta", ascending=True).reset_index(drop=True)
return base, out
def lasso_stability_selection(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5, alphas=None):
if alphas is None:
alphas = np.logspace(-4, 1, 12)
n_groups = int(groups.nunique())
n_splits = max(3, min(max_folds, n_groups))
gkf = GroupKFold(n_splits=n_splits)
counts = pd.Series(0, index=X.columns, dtype=int)
for tr, te in gkf.split(X, y, groups=groups):
Xtr, ytr = X.iloc[tr], y.iloc[tr]
gtr = groups.iloc[tr]
# inner split on training only (not group-aware inside to keep it light)
best_score, best_alpha = -1e9, None
for a in alphas:
pipe = Pipeline([("scaler", StandardScaler(with_mean=True, with_std=True)),
("lasso", Lasso(alpha=a, max_iter=10000, random_state=RANDOM_STATE))])
# simple inner score with ordinary k-fold on training only
kf = KFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)
vals = []
for tr2, te2 in kf.split(Xtr, ytr):
pipe.fit(Xtr.iloc[tr2].values, ytr.iloc[tr2].values)
ypred = pipe.predict(Xtr.iloc[te2].values)
ytrue = ytr.iloc[te2].values
ss_res = np.sum((ytrue - ypred)**2); ss_tot = np.sum((ytrue - np.mean(ytrue))**2)
r2 = 1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan
vals.append(r2)
mean_score = float(np.nanmean(vals))
if mean_score > best_score:
best_score, best_alpha = mean_score, a
# fit with best alpha on full training fold and count non-zero features
pipe = Pipeline([("scaler", StandardScaler(with_mean=True, with_std=True)),
("lasso", Lasso(alpha=best_alpha, max_iter=10000, random_state=RANDOM_STATE))])
pipe.fit(Xtr.values, ytr.values)
coefs = pipe.named_steps["lasso"].coef_
support = (np.abs(coefs) > 1e-12)
counts.loc[X.columns[support]] += 1
freq = (counts / n_splits).rename("lasso_selection_frequency").to_frame()
return freq.sort_values("lasso_selection_frequency", ascending=False)
# ====== LOAD DATASETS ======
evt_book = pd.read_excel(EVENT_FILE, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
def build_dataset(features_path: Path, window_key: str):
# features
feat_book = pd.read_excel(features_path, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
df_feat_raw = feat_book[fsheet].copy()
dfeat = find_day0_column(df_feat_raw)
tfeat = find_ticker_column(df_feat_raw)
feat_grouped, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)
# event
esheet = win_map.get(window_key)
df_evt = evt_book[esheet].copy()
devt = find_day0_column(df_evt)
tevt = find_ticker_column(df_evt)
ycol = find_target_column_event(df_evt)
evt_targets = df_evt.copy()
evt_targets["__day0__"] = normalize_day0(evt_targets[devt])
evt_targets["__ticker__"] = normalize_ticker(evt_targets[tevt])
evt_targets = evt_targets.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])
merged = feat_grouped.merge(evt_targets[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
groups = merged["__ticker__"] # group by ticker for cross validation
X = build_X(merged, num_cols, ycol)
y = merged[ycol].astype(float)
return X, y, groups, fsheet, ycol, dfeat, tfeat, devt, tevt, len(merged)
# ====== MAIN LOOP ======
all_summaries = []
all_lofo = []
all_stability = []
all_drop_runs = []
for features_path in FEATURE_FILES:
for w in WINDOWS:
X, y, groups, fsheet, ycol, dfeat, tfeat, devt, tevt, n_merged = build_dataset(features_path, w)
if X.empty or len(y) < 10:
print(f"Skip {features_path.name} | window {w}: not enough data.")
continue
# Baseline grouped cross validated coefficient of determination
base_cv = grouped_cv_r2(LinearRegression(), X, y, groups, max_folds=MAX_FOLDS)
# Leave-one-feature-out
base_check, lofo = leave_one_feature_out_deltas(X, y, groups, max_folds=MAX_FOLDS)
# The two base numbers should match
base_cv = float(base_cv)
lofo["features_file"] = features_path.name
lofo["window"] = w
all_lofo.append(lofo)
# Suggested drops: features with negative delta below threshold (removing improves score)
drop_list = lofo[lofo["delta"] <= -NEGATIVE_DELTA_THRESHOLD]["feature"].tolist()
# New score after dropping suggested features
if drop_list:
X_pruned = X.drop(columns=drop_list)
new_cv = grouped_cv_r2(LinearRegression(), X_pruned, y, groups, max_folds=MAX_FOLDS)
else:
new_cv = base_cv
all_summaries.append({
"features_file": features_path.name,
"window": w,
"rows_used": len(y),
"features_used": X.shape[1],
"baseline_cross_validated_r_squared": base_cv,
"n_features_flagged_to_drop": len(drop_list),
"new_cross_validated_r_squared_after_drop": new_cv
})
# Lasso stability selection (light)
stability = lasso_stability_selection(X, y, groups, max_folds=MAX_FOLDS)
stability["features_file"] = features_path.name
stability["window"] = w
all_stability.append(stability.reset_index().rename(columns={"index":"feature"}))
# Store the actual drop list for reporting
if drop_list:
all_drop_runs.append(pd.DataFrame({
"features_file": [features_path.name]*len(drop_list),
"window": [w]*len(drop_list),
"feature_dropped": drop_list
}))
# ====== OUTPUT TABLES ======
summary_df = pd.DataFrame(all_summaries).sort_values(["window","features_file"]).reset_index(drop=True)
print("\n=== Summary per features set and window (grouped by ticker) ===")
display(summary_df)
# Save
summary_df.to_csv(DATA_DIR / "feature_pruning_summary.csv", index=False)
lofo_df = pd.concat(all_lofo, ignore_index=True) if all_lofo else pd.DataFrame()
if not lofo_df.empty:
# Order with most harmful first (most negative delta)
lofo_df = lofo_df.sort_values(["window","features_file","delta"])
print("\n=== Leave-one-feature-out deltas (negative = harmful) ===")
display(lofo_df)
lofo_df.to_csv(DATA_DIR / "leave_one_feature_out_deltas.csv", index=False)
stab_df = pd.concat(all_stability, ignore_index=True) if all_stability else pd.DataFrame()
if not stab_df.empty:
print("\n=== Lasso stability selection frequency (0 to 1) ===")
display(stab_df.sort_values(["window","features_file","lasso_selection_frequency"], ascending=[True, True, False]).reset_index(drop=True))
stab_df.to_csv(DATA_DIR / "lasso_stability_selection.csv", index=False)
if all_drop_runs:
drops_df = pd.concat(all_drop_runs, ignore_index=True)
print("\n=== Features flagged for drop by window and set ===")
display(drops_df)
drops_df.to_csv(DATA_DIR / "features_flagged_for_drop.csv", index=False)
print("\nFiles saved to:", DATA_DIR)
print(" - feature_pruning_summary.csv")
print(" - leave_one_feature_out_deltas.csv")
print(" - lasso_stability_selection.csv")
print(" - features_flagged_for_drop.csv")
=== Summary per features set and window (grouped by ticker) ===
| features_file | window | rows_used | features_used | baseline_cross_validated_r_squared | n_features_flagged_to_drop | new_cross_validated_r_squared_after_drop | |
|---|---|---|---|---|---|---|---|
| 0 | features v1.xlsx | 0,1 | 129 | 16 | -0.115372 | 10 | 0.118568 |
| 1 | features v2.xlsx | 0,1 | 129 | 24 | -0.178825 | 11 | 0.032140 |
| 2 | features v3.xlsx | 0,1 | 129 | 41 | -1.321893 | 25 | -0.047858 |
| 3 | features v1.xlsx | 0,3 | 129 | 16 | -0.155072 | 10 | 0.095185 |
| 4 | features v2.xlsx | 0,3 | 129 | 24 | -0.241672 | 11 | -0.036041 |
| 5 | features v3.xlsx | 0,3 | 129 | 41 | -1.708581 | 26 | -0.059018 |
| 6 | features v1.xlsx | 0,5 | 129 | 16 | -0.089552 | 10 | 0.125117 |
| 7 | features v2.xlsx | 0,5 | 129 | 24 | -0.193283 | 10 | 0.015720 |
| 8 | features v3.xlsx | 0,5 | 129 | 41 | -1.425380 | 27 | -0.010384 |
=== Leave-one-feature-out deltas (negative = harmful) ===
| feature | base_cross_validated_r_squared | cross_validated_r_squared_without_feature | delta | features_file | window | |
|---|---|---|---|---|---|---|
| 0 | pre_ret_10d | -0.115372 | -0.040835 | -7.453721e-02 | features v1.xlsx | 0,1 |
| 1 | pre_vol_3d | -0.115372 | -0.054075 | -6.129709e-02 | features v1.xlsx | 0,1 |
| 2 | mkt_ret_1d_lag1 | -0.115372 | -0.072298 | -4.307393e-02 | features v1.xlsx | 0,1 |
| 3 | pre_vol_5d | -0.115372 | -0.075562 | -3.980961e-02 | features v1.xlsx | 0,1 |
| 4 | pre_ret_5d | -0.115372 | -0.079015 | -3.635720e-02 | features v1.xlsx | 0,1 |
| ... | ... | ... | ... | ... | ... | ... |
| 238 | high_yield_option_adjusted_spread_pct | -1.425380 | -1.425380 | 4.072298e-13 | features v3.xlsx | 0,5 |
| 239 | macro_cpi_yoy | -1.425380 | -1.440643 | 1.526227e-02 | features v3.xlsx | 0,5 |
| 240 | vix_level_lag1 | -1.425380 | -1.459459 | 3.407871e-02 | features v3.xlsx | 0,5 |
| 241 | pre_ret_3d | -1.425380 | -1.515569 | 9.018888e-02 | features v3.xlsx | 0,5 |
| 242 | eps_surprise_pct | -1.425380 | -1.851641 | 4.262606e-01 | features v3.xlsx | 0,5 |
243 rows × 6 columns
=== Lasso stability selection frequency (0 to 1) ===
| feature | lasso_selection_frequency | features_file | window | |
|---|---|---|---|---|
| 0 | eps_surprise_pct | 1.000000 | features v1.xlsx | 0,1 |
| 1 | pre_ret_3d | 1.000000 | features v1.xlsx | 0,1 |
| 2 | vix_chg_5d_lag1 | 1.000000 | features v1.xlsx | 0,1 |
| 3 | macro_us10y | 1.000000 | features v1.xlsx | 0,1 |
| 4 | pre_vol_3d | 0.666667 | features v1.xlsx | 0,1 |
| ... | ... | ... | ... | ... |
| 238 | quarter | 0.000000 | features v3.xlsx | 0,5 |
| 239 | vix_x_surprise | 0.000000 | features v3.xlsx | 0,5 |
| 240 | rates_x_surprise | 0.000000 | features v3.xlsx | 0,5 |
| 241 | high_rates_regime | 0.000000 | features v3.xlsx | 0,5 |
| 242 | high_vix_regime | 0.000000 | features v3.xlsx | 0,5 |
243 rows × 4 columns
=== Features flagged for drop by window and set ===
| features_file | window | feature_dropped | |
|---|---|---|---|
| 0 | features v1.xlsx | 0,1 | pre_ret_10d |
| 1 | features v1.xlsx | 0,1 | pre_vol_3d |
| 2 | features v1.xlsx | 0,1 | mkt_ret_1d_lag1 |
| 3 | features v1.xlsx | 0,1 | pre_vol_5d |
| 4 | features v1.xlsx | 0,1 | pre_ret_5d |
| ... | ... | ... | ... |
| 135 | features v3.xlsx | 0,5 | macro_us10y |
| 136 | features v3.xlsx | 0,5 | cpi_x_surprise |
| 137 | features v3.xlsx | 0,5 | high_vix_regime |
| 138 | features v3.xlsx | 0,5 | vix_chg_5d_lag1 |
| 139 | features v3.xlsx | 0,5 | is_january |
140 rows × 3 columns
Files saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - feature_pruning_summary.csv - leave_one_feature_out_deltas.csv - lasso_stability_selection.csv - features_flagged_for_drop.csv
In [1]:
# === Test pruned features v1.1 / v2.1 / v3.1 vs originals (join on day0 + ticker) ===
# If needed first: pip install pandas numpy scikit-learn openpyxl matplotlib
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold
# -------- config --------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), Path("/mnt/data"),
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES_CANDIDATES = [
"features v1.xlsx","features v2.xlsx","features v3.xlsx",
"features v1.1.xlsx","features v2.1.xlsx","features v3.1.xlsx",
]
WINDOWS = ["0,1","0,3","0,5"] # CAR windows to test
MAX_GROUP_FOLDS = 5
RANDOM_STATE = 42
# -------- helpers --------
def find_file(name):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
return None
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands: return next(iter(book))
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
m = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for name in book.keys():
if is_readme_sheet(name): continue
for w, pat in pats.items():
if m[w] is None and pat.search(str(name)): m[w] = name
return m
def find_day0_column(df: pd.DataFrame):
cols = [str(c) for c in df.columns]
strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
# most date-like
best, kbest = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > kbest: best, kbest = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
# fallback
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score:
best, score = c, sc
return best
def normalize_day0(s: pd.Series) -> pd.Series:
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series) -> pd.Series:
return s.astype(str).str.strip().str.upper()
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
# average test coefficient of determination across group folds
n_groups = int(groups.nunique())
n_splits = max(3, min(max_folds, n_groups))
gkf = GroupKFold(n_splits=n_splits)
model = LinearRegression()
scores = []
for tr, te in gkf.split(X, y, groups=groups):
model.fit(X.iloc[tr].values, y.iloc[tr].values)
y_pred = model.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
r2_test = 1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan
scores.append(r2_test)
return float(np.nanmean(scores))
def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
# in-sample coefficient of determination and adjusted coefficient of determination
data = pd.concat([y, X], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
n, p = len(y_c), X_c.shape[1]
if p == 0 or n < max(10, p+2):
return dict(rows_used=int(n), features_used=int(p),
r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
mdl = LinearRegression().fit(X_c.values, y_c.values)
r2 = float(mdl.score(X_c.values, y_c.values))
adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
cv = grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
return dict(rows_used=int(n), features_used=int(p),
r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)
# -------- load event workbook --------
evt_path = find_file(EVENT_FILE)
if evt_path is None:
raise FileNotFoundError("event_study.xlsx not found in any base directory")
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
# -------- run all available features files --------
present = [f for f in FEATURE_FILES_CANDIDATES if find_file(f) is not None]
if not present:
raise FileNotFoundError("No features files found. Check paths.")
print("Testing these files:", present)
all_rows = []
merge_audit = []
for fname in present:
fpath = find_file(fname)
feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
df_feat_raw = feat_book[fsheet].copy()
dfeat = find_day0_column(df_feat_raw)
tfeat = find_ticker_column(df_feat_raw)
feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)
for w in WINDOWS:
esheet = win_map.get(w)
if esheet is None:
print(f"Missing event sheet for window {w}. Skipping.")
continue
df_evt = evt_book[esheet].copy()
devt = find_day0_column(df_evt)
tevt = find_ticker_column(df_evt)
ycol = find_target_col(df_evt)
evt = df_evt.copy()
evt["__day0__"] = normalize_day0(evt[devt])
evt["__ticker__"] = normalize_ticker(evt[tevt])
evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])
merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
groups = merged["__ticker__"]
X = build_X(merged, num_cols, ycol)
y = merged[ycol].astype(float)
# audit
merge_audit.append({
"features_file": fname, "features_sheet": fsheet, "window": w,
"day0_features_col": dfeat, "ticker_features_col": tfeat,
"day0_event_col": devt, "ticker_event_col": tevt,
"merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
})
# metrics
m = fit_and_score(X, y, groups)
m.update(dict(features_file=fname, features_sheet=fsheet, event_sheet=esheet, window=w))
all_rows.append(m)
# -------- results --------
audit_df = pd.DataFrame(merge_audit)
res_df = pd.DataFrame(all_rows)
pd.set_option("display.max_columns", None)
print("\nMerge audit (check keys and row counts):")
display(audit_df)
print("\nResults per features set and window (grouped by ticker):")
display(res_df.sort_values(["window","features_file"]).reset_index(drop=True))
# -------- optional: compare pruned vs original if both exist --------
def base_tag(name: str) -> str:
# "features v1.xlsx" -> "v1", "features v1.1.xlsx" -> "v1"
m = re.search(r"features\s+v(\d+)", name, flags=re.IGNORECASE)
return f"v{m.group(1)}" if m else name
res_df["tag"] = res_df["features_file"].apply(base_tag)
res_df["is_pruned"] = res_df["features_file"].str.contains(r"\.1\.xlsx$", flags=re.IGNORECASE)
pairs = []
for w in WINDOWS:
for tag in sorted(res_df["tag"].unique()):
block = res_df[(res_df["window"] == w) & (res_df["tag"] == tag)]
if block["is_pruned"].nunique() < 2:
continue # need both original and pruned
base = block.loc[block["is_pruned"] == False].iloc[0]
prun = block.loc[block["is_pruned"] == True].iloc[0]
pairs.append({
"window": w, "set": tag,
"baseline_cross_validated_r_squared": base["cross_validated_r_squared"],
"pruned_cross_validated_r_squared": prun["cross_validated_r_squared"],
"delta_cross_validated_r_squared": prun["cross_validated_r_squared"] - base["cross_validated_r_squared"],
"baseline_adjusted_r_squared": base["adjusted_r_squared"],
"pruned_adjusted_r_squared": prun["adjusted_r_squared"],
"delta_adjusted_r_squared": prun["adjusted_r_squared"] - base["adjusted_r_squared"],
"baseline_r_squared": base["r_squared"],
"pruned_r_squared": prun["r_squared"],
"delta_r_squared": prun["r_squared"] - base["r_squared"],
"rows_used_baseline": base["rows_used"], "rows_used_pruned": prun["rows_used"],
"features_used_baseline": base["features_used"], "features_used_pruned": prun["features_used"],
})
if pairs:
comp = pd.DataFrame(pairs).sort_values(["window","set"]).reset_index(drop=True)
print("\nBefore vs after (original vs pruned) — deltas > 0 are good:")
display(comp)
# Save results
out_dir = find_file(EVENT_FILE).parent
res_df.to_csv(out_dir / "test_results_all_sets.csv", index=False)
if pairs:
comp.to_csv(out_dir / "test_results_pruned_vs_original.csv", index=False)
print(f"\nSaved CSVs to: {out_dir}")
print(" - test_results_all_sets.csv")
print(" - test_results_pruned_vs_original.csv")
Testing these files: ['features v1.xlsx', 'features v2.xlsx', 'features v3.xlsx', 'features v1.1.xlsx', 'features v2.1.xlsx', 'features v3.1.xlsx'] Merge audit (check keys and row counts):
| features_file | features_sheet | window | day0_features_col | ticker_features_col | day0_event_col | ticker_event_col | merged_rows | predictor_cols | target_col | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | features v1.xlsx | features | 0,1 | day0 | ticker | day0 | ticker | 129 | 16 | CAR |
| 1 | features v1.xlsx | features | 0,3 | day0 | ticker | day0 | ticker | 129 | 16 | CAR |
| 2 | features v1.xlsx | features | 0,5 | day0 | ticker | day0 | ticker | 129 | 16 | CAR |
| 3 | features v2.xlsx | data | 0,1 | day0 | ticker | day0 | ticker | 129 | 24 | CAR |
| 4 | features v2.xlsx | data | 0,3 | day0 | ticker | day0 | ticker | 129 | 24 | CAR |
| 5 | features v2.xlsx | data | 0,5 | day0 | ticker | day0 | ticker | 129 | 24 | CAR |
| 6 | features v3.xlsx | data | 0,1 | day0 | ticker | day0 | ticker | 129 | 41 | CAR |
| 7 | features v3.xlsx | data | 0,3 | day0 | ticker | day0 | ticker | 129 | 41 | CAR |
| 8 | features v3.xlsx | data | 0,5 | day0 | ticker | day0 | ticker | 129 | 41 | CAR |
| 9 | features v1.1.xlsx | features | 0,1 | day0 | ticker | day0 | ticker | 129 | 14 | CAR |
| 10 | features v1.1.xlsx | features | 0,3 | day0 | ticker | day0 | ticker | 129 | 14 | CAR |
| 11 | features v1.1.xlsx | features | 0,5 | day0 | ticker | day0 | ticker | 129 | 14 | CAR |
| 12 | features v2.1.xlsx | data | 0,1 | day0 | ticker | day0 | ticker | 129 | 20 | CAR |
| 13 | features v2.1.xlsx | data | 0,3 | day0 | ticker | day0 | ticker | 129 | 20 | CAR |
| 14 | features v2.1.xlsx | data | 0,5 | day0 | ticker | day0 | ticker | 129 | 20 | CAR |
| 15 | features v3.1.xlsx | data | 0,1 | day0 | ticker | day0 | ticker | 129 | 37 | CAR |
| 16 | features v3.1.xlsx | data | 0,3 | day0 | ticker | day0 | ticker | 129 | 37 | CAR |
| 17 | features v3.1.xlsx | data | 0,5 | day0 | ticker | day0 | ticker | 129 | 37 | CAR |
Results per features set and window (grouped by ticker):
| rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | features_file | features_sheet | event_sheet | window | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 129 | 14 | 0.289191 | 0.201899 | -0.047634 | features v1.1.xlsx | features | CAR_(0,1) | 0,1 |
| 1 | 129 | 16 | 0.303485 | 0.203983 | -0.115372 | features v1.xlsx | features | CAR_(0,1) | 0,1 |
| 2 | 129 | 20 | 0.318352 | 0.192121 | -0.106293 | features v2.1.xlsx | data | CAR_(0,1) | 0,1 |
| 3 | 129 | 24 | 0.335126 | 0.181693 | -0.178825 | features v2.xlsx | data | CAR_(0,1) | 0,1 |
| 4 | 129 | 37 | 0.424671 | 0.190745 | -0.824364 | features v3.1.xlsx | data | CAR_(0,1) | 0,1 |
| 5 | 129 | 41 | 0.452112 | 0.193912 | -1.321893 | features v3.xlsx | data | CAR_(0,1) | 0,1 |
| 6 | 129 | 14 | 0.238836 | 0.145359 | -0.062822 | features v1.1.xlsx | features | CAR_(0,3) | 0,3 |
| 7 | 129 | 16 | 0.250824 | 0.143799 | -0.155072 | features v1.xlsx | features | CAR_(0,3) | 0,3 |
| 8 | 129 | 20 | 0.258931 | 0.121696 | -0.157287 | features v2.1.xlsx | data | CAR_(0,3) | 0,3 |
| 9 | 129 | 24 | 0.272953 | 0.105174 | -0.241672 | features v2.xlsx | data | CAR_(0,3) | 0,3 |
| 10 | 129 | 37 | 0.379130 | 0.126689 | -1.047901 | features v3.1.xlsx | data | CAR_(0,3) | 0,3 |
| 11 | 129 | 41 | 0.403288 | 0.122078 | -1.708581 | features v3.xlsx | data | CAR_(0,3) | 0,3 |
| 12 | 129 | 14 | 0.248291 | 0.155976 | -0.008339 | features v1.1.xlsx | features | CAR_(0,5) | 0,5 |
| 13 | 129 | 16 | 0.257400 | 0.151314 | -0.089552 | features v1.xlsx | features | CAR_(0,5) | 0,5 |
| 14 | 129 | 20 | 0.265767 | 0.129798 | -0.117638 | features v2.1.xlsx | data | CAR_(0,5) | 0,5 |
| 15 | 129 | 24 | 0.273454 | 0.105789 | -0.193283 | features v2.xlsx | data | CAR_(0,5) | 0,5 |
| 16 | 129 | 37 | 0.393061 | 0.146284 | -0.894411 | features v3.1.xlsx | data | CAR_(0,5) | 0,5 |
| 17 | 129 | 41 | 0.414750 | 0.138942 | -1.425380 | features v3.xlsx | data | CAR_(0,5) | 0,5 |
Before vs after (original vs pruned) — deltas > 0 are good:
| window | set | baseline_cross_validated_r_squared | pruned_cross_validated_r_squared | delta_cross_validated_r_squared | baseline_adjusted_r_squared | pruned_adjusted_r_squared | delta_adjusted_r_squared | baseline_r_squared | pruned_r_squared | delta_r_squared | rows_used_baseline | rows_used_pruned | features_used_baseline | features_used_pruned | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0,1 | v1 | -0.115372 | -0.047634 | 0.067738 | 0.203983 | 0.201899 | -0.002085 | 0.303485 | 0.289191 | -0.014294 | 129 | 129 | 16 | 14 |
| 1 | 0,1 | v2 | -0.178825 | -0.106293 | 0.072531 | 0.181693 | 0.192121 | 0.010427 | 0.335126 | 0.318352 | -0.016774 | 129 | 129 | 24 | 20 |
| 2 | 0,1 | v3 | -1.321893 | -0.824364 | 0.497529 | 0.193912 | 0.190745 | -0.003166 | 0.452112 | 0.424671 | -0.027441 | 129 | 129 | 41 | 37 |
| 3 | 0,3 | v1 | -0.155072 | -0.062822 | 0.092250 | 0.143799 | 0.145359 | 0.001561 | 0.250824 | 0.238836 | -0.011988 | 129 | 129 | 16 | 14 |
| 4 | 0,3 | v2 | -0.241672 | -0.157287 | 0.084385 | 0.105174 | 0.121696 | 0.016522 | 0.272953 | 0.258931 | -0.014023 | 129 | 129 | 24 | 20 |
| 5 | 0,3 | v3 | -1.708581 | -1.047901 | 0.660679 | 0.122078 | 0.126689 | 0.004610 | 0.403288 | 0.379130 | -0.024157 | 129 | 129 | 41 | 37 |
| 6 | 0,5 | v1 | -0.089552 | -0.008339 | 0.081212 | 0.151314 | 0.155976 | 0.004662 | 0.257400 | 0.248291 | -0.009109 | 129 | 129 | 16 | 14 |
| 7 | 0,5 | v2 | -0.193283 | -0.117638 | 0.075645 | 0.105789 | 0.129798 | 0.024009 | 0.273454 | 0.265767 | -0.007686 | 129 | 129 | 24 | 20 |
| 8 | 0,5 | v3 | -1.425380 | -0.894411 | 0.530969 | 0.138942 | 0.146284 | 0.007342 | 0.414750 | 0.393061 | -0.021689 | 129 | 129 | 41 | 37 |
Saved CSVs to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - test_results_all_sets.csv - test_results_pruned_vs_original.csv
In [3]:
# === Compare features v1 vs v1.1 vs v1.2 (join on day0 + ticker) ===
# Metrics: R^2, Adjusted R^2, Grouped (by ticker) Cross-Validated R^2
# If needed first: pip install pandas numpy scikit-learn openpyxl
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold
# ---------- CONFIG ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.xlsx", "features v1.1.xlsx", "features v1.2.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5
# ---------- HELPERS ----------
def find_file(name: str):
for b in BASE_DIRS:
p = b / name
if p.exists():
return p
return None
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands:
return next(iter(book))
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
m = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for name in book.keys():
if is_readme_sheet(name):
continue
for w, pat in pats.items():
if m[w] is None and pat.search(str(name)):
m[w] = name
return m
def find_day0_column(df: pd.DataFrame):
cols = [str(c) for c in df.columns]
strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
best, best_nonnull = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > best_nonnull:
best, best_nonnull = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
# fallback guess
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score:
best, score = c, sc
return best
def normalize_day0(s: pd.Series) -> pd.Series:
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series) -> pd.Series:
return s.astype(str).str.strip().str.upper()
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
# drop zero-variance predictors
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
n_groups = int(groups.nunique())
n_splits = max(3, min(max_folds, n_groups))
gkf = GroupKFold(n_splits=n_splits)
model = LinearRegression()
scores = []
for tr, te in gkf.split(X, y, groups=groups):
model.fit(X.iloc[tr].values, y.iloc[tr].values)
y_pred = model.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
r2_test = 1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan
scores.append(r2_test)
return float(np.nanmean(scores))
def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
data = pd.concat([y, X], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
n, p = len(y_c), X_c.shape[1]
if p == 0 or n < max(10, p+2):
return dict(rows_used=int(n), features_used=int(p),
r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
mdl = LinearRegression().fit(X_c.values, y_c.values)
r2 = float(mdl.score(X_c.values, y_c.values))
adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
cv = grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
return dict(rows_used=int(n), features_used=int(p),
r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)
# ---------- LOAD EVENT ----------
evt_path = find_file(EVENT_FILE)
if evt_path is None:
raise FileNotFoundError("Could not find event_study.xlsx in the configured folders.")
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
# ---------- RUN ----------
present = [f for f in FEATURE_FILES if find_file(f) is not None]
assert present, "None of the v1 files were found."
print("Testing files:", present)
merge_audit = []
results = []
for fname in present:
fpath = find_file(fname)
feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
df_feat_raw = feat_book[fsheet].copy()
dfeat = find_day0_column(df_feat_raw)
tfeat = find_ticker_column(df_feat_raw)
feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)
for w in WINDOWS:
esheet = win_map.get(w)
if esheet is None:
print(f"Missing event sheet for window {w}. Skipping.")
continue
df_evt = evt_book[esheet].copy()
devt = find_day0_column(df_evt)
tevt = find_ticker_column(df_evt)
ycol = find_target_col(df_evt)
evt = df_evt.copy()
evt["__day0__"] = normalize_day0(evt[devt])
evt["__ticker__"] = normalize_ticker(evt[tevt])
evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])
merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
groups = merged["__ticker__"]
X = build_X(merged, num_cols, ycol)
y = merged[ycol].astype(float)
merge_audit.append({
"features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
"day0_features_col": dfeat, "ticker_features_col": tfeat,
"day0_event_col": devt, "ticker_event_col": tevt,
"merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
})
m = fit_and_score(X, y, groups)
m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
results.append(m)
# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)
print("\nMerge audit:")
display(pd.DataFrame(merge_audit))
res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1 vs v1.1 vs v1.2):")
display(res_df)
print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
columns="features_file",
values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
aggfunc="first")
display(wide)
# Save CSVs next to your data
out_dir = evt_path.parent
res_df.to_csv(out_dir / "v1_v1.1_v1.2_results.csv", index=False)
wide.to_csv(out_dir / "v1_v1.1_v1.2_comparison_table.csv")
print(f"\nSaved to: {out_dir}")
print(" - v1_v1.1_v1.2_results.csv")
print(" - v1_v1.1_v1.2_comparison_table.csv")
Testing files: ['features v1.xlsx', 'features v1.1.xlsx', 'features v1.2.xlsx'] Merge audit:
| features_file | features_sheet | event_sheet | window | day0_features_col | ticker_features_col | day0_event_col | ticker_event_col | merged_rows | predictor_cols | target_col | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | features v1.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 16 | CAR |
| 1 | features v1.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 16 | CAR |
| 2 | features v1.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 16 | CAR |
| 3 | features v1.1.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 14 | CAR |
| 4 | features v1.1.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 14 | CAR |
| 5 | features v1.1.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 14 | CAR |
| 6 | features v1.2.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 7 | features v1.2.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 8 | features v1.2.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
Results (v1 vs v1.1 vs v1.2):
| rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | features_file | features_sheet | window | |
|---|---|---|---|---|---|---|---|---|
| 0 | 129 | 14 | 0.289191 | 0.201899 | -0.047634 | features v1.1.xlsx | features | 0,1 |
| 1 | 129 | 8 | 0.245160 | 0.194838 | 0.068034 | features v1.2.xlsx | features | 0,1 |
| 2 | 129 | 16 | 0.303485 | 0.203983 | -0.115372 | features v1.xlsx | features | 0,1 |
| 3 | 129 | 14 | 0.238836 | 0.145359 | -0.062822 | features v1.1.xlsx | features | 0,3 |
| 4 | 129 | 8 | 0.201481 | 0.148246 | 0.094267 | features v1.2.xlsx | features | 0,3 |
| 5 | 129 | 16 | 0.250824 | 0.143799 | -0.155072 | features v1.xlsx | features | 0,3 |
| 6 | 129 | 14 | 0.248291 | 0.155976 | -0.008339 | features v1.1.xlsx | features | 0,5 |
| 7 | 129 | 8 | 0.214735 | 0.162384 | 0.121771 | features v1.2.xlsx | features | 0,5 |
| 8 | 129 | 16 | 0.257400 | 0.151314 | -0.089552 | features v1.xlsx | features | 0,5 |
Comparison table (rows = windows | columns = metrics per file):
| adjusted_r_squared | cross_validated_r_squared | r_squared | |||||||
|---|---|---|---|---|---|---|---|---|---|
| features_file | features v1.1.xlsx | features v1.2.xlsx | features v1.xlsx | features v1.1.xlsx | features v1.2.xlsx | features v1.xlsx | features v1.1.xlsx | features v1.2.xlsx | features v1.xlsx |
| window | |||||||||
| 0,1 | 0.201899 | 0.194838 | 0.203983 | -0.047634 | 0.068034 | -0.115372 | 0.289191 | 0.245160 | 0.303485 |
| 0,3 | 0.145359 | 0.148246 | 0.143799 | -0.062822 | 0.094267 | -0.155072 | 0.238836 | 0.201481 | 0.250824 |
| 0,5 | 0.155976 | 0.162384 | 0.151314 | -0.008339 | 0.121771 | -0.089552 | 0.248291 | 0.214735 | 0.257400 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - v1_v1.1_v1.2_results.csv - v1_v1.1_v1.2_comparison_table.csv
In [11]:
# === Compare features v1.2 vs v1.3 (join on day0 + ticker) ===
# Metrics: R^2, Adjusted R^2, Grouped Cross-Validated R^2 (by ticker)
# If needed first: pip install pandas numpy scikit-learn openpyxl
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold
# ---------- CONFIG ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.2.xlsx", "features v1.3.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5
# ---------- HELPERS ----------
def find_file(name: str):
for b in BASE_DIRS:
p = b / name
if p.exists():
return p
return None
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands:
return next(iter(book))
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
m = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for name in book.keys():
if is_readme_sheet(name):
continue
for w, pat in pats.items():
if m[w] is None and pat.search(str(name)):
m[w] = name
return m
def find_day0_column(df: pd.DataFrame):
cols = [str(c) for c in df.columns]
strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
# most date-like
best, kbest = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > kbest: best, kbest = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
# fallback guess
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score:
best, score = c, sc
return best
def normalize_day0(s: pd.Series) -> pd.Series:
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series) -> pd.Series:
return s.astype(str).str.strip().str.upper()
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
# drop zero-variance predictors
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
n_groups = int(groups.nunique())
n_splits = max(3, min(max_folds, n_groups))
gkf = GroupKFold(n_splits=n_splits)
model = LinearRegression()
scores = []
for tr, te in gkf.split(X, y, groups=groups):
model.fit(X.iloc[tr].values, y.iloc[tr].values)
y_pred = model.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
r2_test = 1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan
scores.append(r2_test)
return float(np.nanmean(scores))
def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
data = pd.concat([y, X], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
n, p = len(y_c), X_c.shape[1]
if p == 0 or n < max(10, p+2):
return dict(rows_used=int(n), features_used=int(p),
r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
mdl = LinearRegression().fit(X_c.values, y_c.values)
r2 = float(mdl.score(X_c.values, y_c.values))
adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
cv = grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
return dict(rows_used=int(n), features_used=int(p),
r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)
# ---------- LOAD EVENT ----------
evt_path = find_file(EVENT_FILE)
if evt_path is None:
raise FileNotFoundError("Could not find event_study.xlsx in the configured folders.")
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
# ---------- RUN ----------
present = [f for f in FEATURE_FILES if find_file(f) is not None]
assert present, "features v1.2.xlsx and features v1.3.xlsx were not found."
print("Testing files:", present)
merge_audit = []
results = []
for fname in present:
fpath = find_file(fname)
feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
df_feat_raw = feat_book[fsheet].copy()
dfeat = find_day0_column(df_feat_raw)
tfeat = find_ticker_column(df_feat_raw)
feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)
for w in WINDOWS:
esheet = win_map.get(w)
if esheet is None:
print(f"Missing event sheet for window {w}. Skipping.")
continue
df_evt = evt_book[esheet].copy()
devt = find_day0_column(df_evt)
tevt = find_ticker_column(df_evt)
ycol = find_target_col(df_evt)
evt = df_evt.copy()
evt["__day0__"] = normalize_day0(evt[devt])
evt["__ticker__"] = normalize_ticker(evt[tevt])
evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])
merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
groups = merged["__ticker__"]
X = build_X(merged, num_cols, ycol)
y = merged[ycol].astype(float)
merge_audit.append({
"features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
"day0_features_col": dfeat, "ticker_features_col": tfeat,
"day0_event_col": devt, "ticker_event_col": tevt,
"merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
})
m = fit_and_score(X, y, groups)
m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
results.append(m)
# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)
print("\nMerge audit:")
display(pd.DataFrame(merge_audit))
res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1.2 vs v1.3):")
display(res_df)
print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
columns="features_file",
values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
aggfunc="first")
display(wide)
# ---------- DELTAS (v1.3 - v1.2) ----------
if set(FEATURE_FILES).issubset(set(res_df["features_file"].unique())):
pairs = []
for w in WINDOWS:
a = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
b = res_df[(res_df["features_file"]=="features v1.3.xlsx") & (res_df["window"]==w)]
if not a.empty and not b.empty:
pairs.append({
"window": w,
"delta_cross_validated_r_squared": float(b["cross_validated_r_squared"].iloc[0] - a["cross_validated_r_squared"].iloc[0]),
"delta_adjusted_r_squared": float(b["adjusted_r_squared"].iloc[0] - a["adjusted_r_squared"].iloc[0]),
"delta_r_squared": float(b["r_squared"].iloc[0] - a["r_squared"].iloc[0]),
"rows_used_v1.2": int(a["rows_used"].iloc[0]),
"rows_used_v1.3": int(b["rows_used"].iloc[0]),
"features_used_v1.2": int(a["features_used"].iloc[0]),
"features_used_v1.3": int(b["features_used"].iloc[0]),
})
if pairs:
deltas = pd.DataFrame(pairs)
print("\nDeltas (v1.3 minus v1.2) — positive is good:")
display(deltas)
# Save CSVs next to your data
out_dir = find_file(EVENT_FILE).parent
res_df.to_csv(out_dir / "v1.2_vs_v1.3_results.csv", index=False)
wide.to_csv(out_dir / "v1.2_vs_v1.3_comparison_table.csv")
print(f"\nSaved to: {out_dir}")
print(" - v1.2_vs_v1.3_results.csv")
print(" - v1.2_vs_v1.3_comparison_table.csv")
Testing files: ['features v1.2.xlsx', 'features v1.3.xlsx'] Merge audit:
| features_file | features_sheet | event_sheet | window | day0_features_col | ticker_features_col | day0_event_col | ticker_event_col | merged_rows | predictor_cols | target_col | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | features v1.2.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 1 | features v1.2.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 2 | features v1.2.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 3 | features v1.3.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 16 | CAR |
| 4 | features v1.3.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 16 | CAR |
| 5 | features v1.3.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 16 | CAR |
Results (v1.2 vs v1.3):
| rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | features_file | features_sheet | window | |
|---|---|---|---|---|---|---|---|---|
| 0 | 129 | 8 | 0.245160 | 0.194838 | 0.068034 | features v1.2.xlsx | features | 0,1 |
| 1 | 129 | 16 | 0.259498 | 0.153712 | -0.337959 | features v1.3.xlsx | features | 0,1 |
| 2 | 129 | 8 | 0.201481 | 0.148246 | 0.094267 | features v1.2.xlsx | features | 0,3 |
| 3 | 129 | 16 | 0.216124 | 0.104142 | -0.111012 | features v1.3.xlsx | features | 0,3 |
| 4 | 129 | 8 | 0.214735 | 0.162384 | 0.121771 | features v1.2.xlsx | features | 0,5 |
| 5 | 129 | 16 | 0.226652 | 0.116174 | -0.037386 | features v1.3.xlsx | features | 0,5 |
Comparison table (rows = windows | columns = metrics per file):
| adjusted_r_squared | cross_validated_r_squared | r_squared | ||||
|---|---|---|---|---|---|---|
| features_file | features v1.2.xlsx | features v1.3.xlsx | features v1.2.xlsx | features v1.3.xlsx | features v1.2.xlsx | features v1.3.xlsx |
| window | ||||||
| 0,1 | 0.194838 | 0.153712 | 0.068034 | -0.337959 | 0.245160 | 0.259498 |
| 0,3 | 0.148246 | 0.104142 | 0.094267 | -0.111012 | 0.201481 | 0.216124 |
| 0,5 | 0.162384 | 0.116174 | 0.121771 | -0.037386 | 0.214735 | 0.226652 |
Deltas (v1.3 minus v1.2) — positive is good:
| window | delta_cross_validated_r_squared | delta_adjusted_r_squared | delta_r_squared | rows_used_v1.2 | rows_used_v1.3 | features_used_v1.2 | features_used_v1.3 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0,1 | -0.405993 | -0.041126 | 0.014337 | 129 | 129 | 8 | 16 |
| 1 | 0,3 | -0.205279 | -0.044105 | 0.014643 | 129 | 129 | 8 | 16 |
| 2 | 0,5 | -0.159157 | -0.046209 | 0.011918 | 129 | 129 | 8 | 16 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - v1.2_vs_v1.3_results.csv - v1.2_vs_v1.3_comparison_table.csv
In [19]:
# === FIXED: Grow v1.2 using v3 candidates with safe group-aware CV ===
# - Adapts folds to the number of tickers in each split
# - Falls back to ordinary KFold when groups are too few
# - Same outputs as before (baseline, marginal gains, selected features, summary)
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
# ---------- CONFIG ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
BASE_FILE = "features v1.2.xlsx"
POOL_FILE = "features v3.xlsx"
WINDOWS = ["0,1","0,3","0,5"]
MAX_OUTER_FOLDS = 5
MAX_INNER_FOLDS = 3
MAX_FEATURES_TO_ADD = 5
MIN_GAIN = 0.01
# ---------- HELPERS ----------
def find_file(name: str):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(f"Could not find {name}")
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands: return next(iter(book))
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
m = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for name in book:
if is_readme_sheet(name): continue
for w, pat in pats.items():
if m[w] is None and pat.search(str(name)): m[w] = name
return m
def find_day0_column(df: pd.DataFrame):
cols = [str(c) for c in df.columns]
strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
best, kbest = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > kbest: best, kbest = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score: best, score = c, sc
return best
def normalize_day0(s: pd.Series) -> pd.Series:
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series) -> pd.Series:
return s.astype(str).str.strip().str.upper()
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def test_r2_on_fold(model, X_train, y_train, X_test, y_test):
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
ss_res = np.sum((y_test - y_hat)**2)
ss_tot = np.sum((y_test - np.mean(y_test))**2)
return 1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan
def safe_group_cv_scores(X, y, groups, max_splits=5, min_splits=2):
"""Return mean test R^2 and the splitter used.
Uses GroupKFold if groups are enough, else falls back to ordinary KFold."""
n_groups = int(pd.Series(groups).nunique())
if n_groups >= min_splits:
n_splits = min(max_splits, n_groups)
splitter = GroupKFold(n_splits=n_splits)
model = LinearRegression()
scores = []
for tr, te in splitter.split(X, y, groups=groups):
scores.append(test_r2_on_fold(model, X.iloc[tr].values, y.iloc[tr].values,
X.iloc[te].values, y.iloc[te].values))
return float(np.nanmean(scores)), splitter
# fallback: ordinary KFold
n = len(X)
if n < 3:
return np.nan, None
n_splits = min(3, n)
splitter = KFold(n_splits=n_splits, shuffle=True, random_state=42)
model = LinearRegression()
scores = []
for tr, te in splitter.split(X, y):
scores.append(test_r2_on_fold(model, X.iloc[tr].values, y.iloc[tr].values,
X.iloc[te].values, y.iloc[te].values))
return float(np.nanmean(scores)), splitter
def in_sample_and_adjusted(X: pd.DataFrame, y: pd.Series):
if X.shape[1] == 0:
return np.nan, np.nan
mdl = LinearRegression().fit(X.values, y.values)
r2 = float(mdl.score(X.values, y.values))
n, p = len(y), X.shape[1]
adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
return r2, adj
# ---------- LOAD ----------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
base_path = find_file(BASE_FILE)
pool_path = find_file(POOL_FILE)
base_book = pd.read_excel(base_path, sheet_name=None, engine="openpyxl")
base_sheet = choose_features_sheet(base_book)
base_raw = base_book[base_sheet].copy()
base_day0 = find_day0_column(base_raw)
base_ticker = find_ticker_column(base_raw)
base_grouped, base_num_cols = aggregate_features(base_raw, base_day0, base_ticker)
pool_book = pd.read_excel(pool_path, sheet_name=None, engine="openpyxl")
pool_sheet = choose_features_sheet(pool_book)
pool_raw = pool_book[pool_sheet].copy()
pool_day0 = find_day0_column(pool_raw)
pool_ticker = find_ticker_column(pool_raw)
pool_grouped, pool_num_cols = aggregate_features(pool_raw, pool_day0, pool_ticker)
candidate_cols = [c for c in pool_num_cols if c not in base_num_cols]
# ---------- WORK ----------
all_quick = []
all_selected = []
all_summary = []
for window in WINDOWS:
esheet = win_map.get(window)
if esheet is None:
print(f"Skip window {window}: event sheet not found.")
continue
df_evt = evt_book[esheet].copy()
event_day0 = find_day0_column(df_evt)
event_ticker = find_ticker_column(df_evt)
y_col = find_target_col(df_evt)
evt = df_evt.copy()
evt["__day0__"] = normalize_day0(evt[event_day0])
evt["__ticker__"] = normalize_ticker(evt[event_ticker])
evt = evt.dropna(subset=["__day0__","__ticker__", y_col]).drop_duplicates(subset=["__day0__","__ticker__"])
merged_base = base_grouped.merge(evt[["__day0__","__ticker__", y_col]], on=["__day0__","__ticker__"], how="inner")
merged_pool = pool_grouped[["__day0__","__ticker__"] + candidate_cols]
merged = merged_base.merge(merged_pool, on=["__day0__","__ticker__"], how="left")
X_base = build_X(merged, base_num_cols, y_col)
y = merged[y_col].astype(float)
groups = merged["__ticker__"]
# Baseline
base_cv, outer_splitter = safe_group_cv_scores(X_base, y, groups, max_splits=MAX_OUTER_FOLDS, min_splits=2)
base_r2, base_adj = in_sample_and_adjusted(X_base, y)
# Quick marginal gains (add one feature at a time)
quick_rows = []
for c in candidate_cols:
if c not in merged.columns:
continue
Xt = pd.concat([X_base, merged[[c]]], axis=1)
data = pd.concat([y, Xt], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
if X_c.shape[1] == 0 or len(y_c) < 10:
continue
cv_r2, _ = safe_group_cv_scores(X_c, y_c, groups.loc[X_c.index], max_splits=MAX_OUTER_FOLDS, min_splits=2)
quick_rows.append({"window": window, "feature": c, "cv_with_feature": cv_r2, "delta": cv_r2 - base_cv})
quick_df = pd.DataFrame(quick_rows).sort_values("delta", ascending=False).reset_index(drop=True)
all_quick.append(quick_df)
# Nested grouped forward selection (safe inner splitting)
# Build outer splitter (group-aware if possible)
outer_scores = []
fold_selected = []
# If we could not build a group-aware splitter (very rare), fall back to KFold
if isinstance(outer_splitter, GroupKFold):
splits = list(outer_splitter.split(X_base, y, groups=groups))
elif isinstance(outer_splitter, KFold):
splits = list(outer_splitter.split(X_base, y))
else:
# no splitter possible
splits = [(np.arange(len(X_base)), np.arange(0))]
for tr, te in splits:
Xb_tr, Xb_te = X_base.iloc[tr], X_base.iloc[te]
y_tr, y_te = y.iloc[tr], y.iloc[te]
groups_tr = groups.iloc[tr]
# inner helper with safe group CV on training fold
def inner_cv_score(Xt, yt):
return safe_group_cv_scores(Xt, yt, groups_tr.loc[Xt.index], max_splits=MAX_INNER_FOLDS, min_splits=2)[0]
# start point
data_tr = pd.concat([y_tr, Xb_tr], axis=1).dropna()
y_tr_c, X_tr_c = data_tr.iloc[:,0], data_tr.iloc[:,1:]
base_inner = inner_cv_score(X_tr_c, y_tr_c)
avail = [c for c in candidate_cols if c in merged.columns]
chosen = []
for _ in range(MAX_FEATURES_TO_ADD):
best_gain, best_feat = 0.0, None
for c in avail:
col = merged.loc[Xb_tr.index, c]
Xt = pd.concat([X_tr_c, col], axis=1).dropna()
yt = y_tr.loc[Xt.index]
if Xt.shape[1] == 0 or len(yt) < 10:
continue
score = inner_cv_score(Xt, yt)
gain = score - base_inner
if gain > best_gain:
best_gain, best_feat = gain, c
if best_feat is None or best_gain < MIN_GAIN:
break
# accept and update
chosen.append(best_feat)
avail.remove(best_feat)
X_tr_c = pd.concat([X_tr_c, merged.loc[Xb_tr.index, [best_feat]]], axis=1).dropna()
y_tr_c = y_tr.loc[X_tr_c.index]
base_inner = inner_cv_score(X_tr_c, y_tr_c)
fold_selected.append(chosen)
# evaluate on outer test fold
X_te = Xb_te.copy()
if chosen:
X_te = pd.concat([X_te, merged.loc[Xb_te.index, chosen]], axis=1)
data_te = pd.concat([y_te, X_te], axis=1).dropna()
y_te_c, X_te_c = data_te.iloc[:,0], data_te.iloc[:,1:]
if X_te_c.shape[1] == 0 or len(y_te_c) < 2:
outer_scores.append(np.nan)
else:
outer_scores.append(test_r2_on_fold(LinearRegression(),
X_tr_c.values, y_tr_c.values,
X_te_c.values, y_te_c.values))
# Frequencies across outer folds
flat = [f for sub in fold_selected for f in sub]
freq = pd.Series(flat).value_counts().rename("selected_in_folds").to_frame()
freq["window"] = window
freq = freq.reset_index().rename(columns={"index":"feature"})
all_selected.append(freq)
# Union of features picked in at least half the folds
keep_union = []
if not freq.empty and len(splits) > 0:
half = max(1, int(np.ceil(len(splits)/2)))
keep_union = freq.loc[freq["selected_in_folds"] >= half, "feature"].tolist()
X_full = X_base.copy()
if keep_union:
X_full = pd.concat([X_full, merged[keep_union]], axis=1)
data_full = pd.concat([y, X_full], axis=1).dropna()
y_full, X_full_c = data_full.iloc[:,0], data_full.iloc[:,1:]
full_cv, _ = safe_group_cv_scores(X_full_c, y_full, groups.loc[X_full_c.index], max_splits=MAX_OUTER_FOLDS, min_splits=2)
full_r2, full_adj = in_sample_and_adjusted(X_full_c, y_full)
all_summary.append({
"window": window,
"baseline_cross_validated_r_squared": base_cv,
"baseline_r_squared": base_r2,
"baseline_adjusted_r_squared": base_adj,
"nested_forward_mean_test_cross_validated_r_squared": float(np.nanmean(outer_scores)) if outer_scores else np.nan,
"selected_union_features": ", ".join(keep_union) if keep_union else "",
"union_model_cross_validated_r_squared": full_cv,
"union_model_r_squared": full_r2,
"union_model_adjusted_r_squared": full_adj,
"n_selected_union": len(keep_union)
})
# ---------- REPORT ----------
quick_all = pd.concat(all_quick, ignore_index=True) if all_quick else pd.DataFrame()
selected_all = pd.concat(all_selected, ignore_index=True) if all_selected else pd.DataFrame()
summary = pd.DataFrame(all_summary)
pd.set_option("display.max_rows", 200)
pd.set_option("display.max_columns", None)
print("\n=== Baseline vs improved (per window) ===")
display(summary)
if not quick_all.empty:
print("\n=== Quick marginal gains (top 25 per window) — delta vs v1.2 baseline cross validated coefficient of determination ===")
display(quick_all.sort_values(["window","delta"], ascending=[True, False]).groupby("window").head(25))
if not selected_all.empty:
print("\n=== Features selected by nested forward selection (frequency across outer folds) ===")
display(selected_all.sort_values(["window","selected_in_folds"], ascending=[True, False]))
# ---------- SAVE ----------
out_dir = find_file(EVENT_FILE).parent
summary.to_csv(out_dir / "v12_growth_summary.csv", index=False)
if not quick_all.empty:
quick_all.to_csv(out_dir / "v12_growth_quick_marginal_gains.csv", index=False)
if not selected_all.empty:
selected_all.to_csv(out_dir / "v12_growth_selected_frequencies.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v12_growth_summary.csv")
print(" - v12_growth_quick_marginal_gains.csv")
print(" - v12_growth_selected_frequencies.csv")
=== Baseline vs improved (per window) ===
| window | baseline_cross_validated_r_squared | baseline_r_squared | baseline_adjusted_r_squared | nested_forward_mean_test_cross_validated_r_squared | selected_union_features | union_model_cross_validated_r_squared | union_model_r_squared | union_model_adjusted_r_squared | n_selected_union | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0,1 | 0.068034 | 0.245160 | 0.194838 | -0.121056 | baa_minus_aaa_bp, pre_vol_10d | 0.040045 | 0.272545 | 0.210896 | 2 |
| 1 | 0,3 | 0.094267 | 0.201481 | 0.148246 | -0.121196 | pre_vol_10d | 0.093876 | 0.212336 | 0.152765 | 1 |
| 2 | 0,5 | 0.121771 | 0.214735 | 0.162384 | -0.034418 | pre_vol_10d | 0.120651 | 0.221060 | 0.162149 | 1 |
=== Quick marginal gains (top 25 per window) — delta vs v1.2 baseline cross validated coefficient of determination ===
| window | feature | cv_with_feature | delta | |
|---|---|---|---|---|
| 0 | 0,1 | macro_cpi_yoy | 0.081215 | 1.318111e-02 |
| 1 | 0,1 | pre_vol_10d | 0.072635 | 4.601765e-03 |
| 2 | 0,1 | rates_x_surprise | 0.069896 | 1.862384e-03 |
| 3 | 0,1 | is_amc | 0.068034 | -8.881784e-16 |
| 4 | 0,1 | is_bmo | 0.068034 | -8.881784e-16 |
| 5 | 0,1 | is_friday | 0.068034 | -8.881784e-16 |
| 6 | 0,1 | pre_ret_10d | 0.065582 | -2.451491e-03 |
| 7 | 0,1 | cpi_x_prevol5d | 0.063162 | -4.871555e-03 |
| 8 | 0,1 | is_monday | 0.060641 | -7.392425e-03 |
| 9 | 0,1 | vix_chg_10d_lag1 | 0.060394 | -7.639289e-03 |
| 10 | 0,1 | high_vix_regime | 0.057423 | -1.061060e-02 |
| 11 | 0,1 | cpi_x_surprise | 0.054169 | -1.386500e-02 |
| 12 | 0,1 | is_january | 0.051872 | -1.616123e-02 |
| 13 | 0,1 | investment_grade_option_adjusted_spread_bp | 0.050660 | -1.737398e-02 |
| 14 | 0,1 | investment_grade_option_adjusted_spread_pct | 0.050660 | -1.737398e-02 |
| 15 | 0,1 | macro_fedfunds | 0.050613 | -1.742084e-02 |
| 16 | 0,1 | high_rates_regime | 0.049974 | -1.805981e-02 |
| 17 | 0,1 | pre_vol_3d | 0.048157 | -1.987686e-02 |
| 18 | 0,1 | weekly_density | 0.048107 | -1.992688e-02 |
| 19 | 0,1 | high_density_week | 0.048107 | -1.992688e-02 |
| 20 | 0,1 | month | 0.046575 | -2.145911e-02 |
| 21 | 0,1 | baa_minus_aaa_bp | 0.044385 | -2.364825e-02 |
| 22 | 0,1 | baa_minus_aaa_pct | 0.044385 | -2.364825e-02 |
| 23 | 0,1 | mkt_ret_1d_lag1 | 0.041337 | -2.669715e-02 |
| 24 | 0,1 | quarter | 0.040292 | -2.774160e-02 |
| 36 | 0,3 | macro_cpi_yoy | 0.096439 | 2.172138e-03 |
| 37 | 0,3 | is_amc | 0.094267 | 6.383782e-16 |
| 38 | 0,3 | is_friday | 0.094267 | 6.383782e-16 |
| 39 | 0,3 | is_bmo | 0.094267 | 6.383782e-16 |
| 40 | 0,3 | pre_vol_10d | 0.093876 | -3.911947e-04 |
| 41 | 0,3 | is_monday | 0.093087 | -1.179961e-03 |
| 42 | 0,3 | macro_fedfunds | 0.090032 | -4.235049e-03 |
| 43 | 0,3 | rates_x_surprise | 0.089324 | -4.943540e-03 |
| 44 | 0,3 | cpi_x_prevol5d | 0.088516 | -5.751032e-03 |
| 45 | 0,3 | investment_grade_option_adjusted_spread_bp | 0.088365 | -5.902608e-03 |
| 46 | 0,3 | investment_grade_option_adjusted_spread_pct | 0.088365 | -5.902608e-03 |
| 47 | 0,3 | cpi_x_surprise | 0.080796 | -1.347118e-02 |
| 48 | 0,3 | pre_vol_3d | 0.080411 | -1.385620e-02 |
| 49 | 0,3 | high_rates_regime | 0.079143 | -1.512391e-02 |
| 50 | 0,3 | is_january | 0.078816 | -1.545143e-02 |
| 51 | 0,3 | high_vix_regime | 0.077928 | -1.633957e-02 |
| 52 | 0,3 | pre_ret_10d | 0.077158 | -1.710937e-02 |
| 53 | 0,3 | baa_minus_aaa_pct | 0.076349 | -1.791780e-02 |
| 54 | 0,3 | baa_minus_aaa_bp | 0.076349 | -1.791780e-02 |
| 55 | 0,3 | weekly_density | 0.074380 | -1.988693e-02 |
| 56 | 0,3 | high_density_week | 0.074380 | -1.988693e-02 |
| 57 | 0,3 | month | 0.068821 | -2.544574e-02 |
| 58 | 0,3 | mkt_ret_1d_lag1 | 0.067033 | -2.723425e-02 |
| 59 | 0,3 | quarter | 0.059743 | -3.452387e-02 |
| 60 | 0,3 | day_of_week | 0.054746 | -3.952102e-02 |
| 72 | 0,5 | is_amc | 0.121771 | 1.221245e-15 |
| 73 | 0,5 | is_bmo | 0.121771 | 1.221245e-15 |
| 74 | 0,5 | is_friday | 0.121771 | 1.221245e-15 |
| 75 | 0,5 | pre_vol_10d | 0.120651 | -1.120119e-03 |
| 76 | 0,5 | macro_fedfunds | 0.119332 | -2.438700e-03 |
| 77 | 0,5 | macro_cpi_yoy | 0.117592 | -4.178903e-03 |
| 78 | 0,5 | high_density_week | 0.116452 | -5.318717e-03 |
| 79 | 0,5 | weekly_density | 0.116452 | -5.318717e-03 |
| 80 | 0,5 | cpi_x_prevol5d | 0.115113 | -6.657469e-03 |
| 81 | 0,5 | is_monday | 0.114571 | -7.199965e-03 |
| 82 | 0,5 | investment_grade_option_adjusted_spread_bp | 0.113210 | -8.560419e-03 |
| 83 | 0,5 | investment_grade_option_adjusted_spread_pct | 0.113210 | -8.560419e-03 |
| 84 | 0,5 | high_vix_regime | 0.111899 | -9.871706e-03 |
| 85 | 0,5 | high_rates_regime | 0.111492 | -1.027892e-02 |
| 86 | 0,5 | baa_minus_aaa_bp | 0.110206 | -1.156436e-02 |
| 87 | 0,5 | baa_minus_aaa_pct | 0.110206 | -1.156436e-02 |
| 88 | 0,5 | is_january | 0.108703 | -1.306799e-02 |
| 89 | 0,5 | pre_ret_10d | 0.106843 | -1.492733e-02 |
| 90 | 0,5 | pre_vol_3d | 0.105928 | -1.584277e-02 |
| 91 | 0,5 | month | 0.101969 | -1.980198e-02 |
| 92 | 0,5 | rates_x_surprise | 0.097229 | -2.454192e-02 |
| 93 | 0,5 | mkt_ret_1d_lag1 | 0.097109 | -2.466205e-02 |
| 94 | 0,5 | cpi_x_surprise | 0.095834 | -2.593715e-02 |
| 95 | 0,5 | quarter | 0.092512 | -2.925895e-02 |
| 96 | 0,5 | vix_chg_10d_lag1 | 0.091427 | -3.034321e-02 |
=== Features selected by nested forward selection (frequency across outer folds) ===
| feature | selected_in_folds | window | |
|---|---|---|---|
| 0 | baa_minus_aaa_bp | 2 | 0,1 |
| 1 | pre_vol_10d | 2 | 0,1 |
| 2 | is_january | 1 | 0,1 |
| 3 | is_q4 | 1 | 0,1 |
| 4 | month | 1 | 0,1 |
| 5 | day_of_week | 1 | 0,1 |
| 6 | rates_x_surprise | 1 | 0,1 |
| 7 | moody_aaa_yield_pct | 1 | 0,1 |
| 8 | vix_x_surprise | 1 | 0,1 |
| 9 | macro_cpi_yoy | 1 | 0,1 |
| 10 | pre_vol_10d | 2 | 0,3 |
| 11 | is_q4 | 1 | 0,3 |
| 12 | is_january | 1 | 0,3 |
| 13 | day_of_week | 1 | 0,3 |
| 14 | rates_x_surprise | 1 | 0,3 |
| 15 | vix_x_surprise | 1 | 0,3 |
| 16 | moody_aaa_yield_pct | 1 | 0,3 |
| 17 | high_vix_regime | 1 | 0,3 |
| 18 | is_monday | 1 | 0,3 |
| 19 | pre_vol_10d | 2 | 0,5 |
| 20 | month | 1 | 0,5 |
| 21 | rates_x_surprise | 1 | 0,5 |
| 22 | is_january | 1 | 0,5 |
| 23 | baa_minus_aaa_pct | 1 | 0,5 |
| 24 | day_of_week | 1 | 0,5 |
| 25 | vix_x_surprise | 1 | 0,5 |
| 26 | baa_minus_aaa_bp | 1 | 0,5 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - v12_growth_summary.csv - v12_growth_quick_marginal_gains.csv - v12_growth_selected_frequencies.csv
In [21]:
# === Compare features v1.2 vs v1.4 (join on day0 + ticker) ===
# Metrics: R^2, Adjusted R^2, Grouped Cross-Validated R^2 (ticker-aware)
# If needed first: pip install pandas numpy scikit-learn openpyxl
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
# ---------- CONFIG ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.2.xlsx", "features v1.4.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5
# ---------- HELPERS ----------
def find_file(name: str):
for b in BASE_DIRS:
p = b / name
if p.exists():
return p
return None
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands:
return next(iter(book))
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
m = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for name in book.keys():
if is_readme_sheet(name):
continue
for w, pat in pats.items():
if m[w] is None and pat.search(str(name)):
m[w] = name
return m
def find_day0_column(df: pd.DataFrame):
cols = [str(c) for c in df.columns]
strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
best, best_nonnull = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > best_nonnull:
best, best_nonnull = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score:
best, score = c, sc
return best
def normalize_day0(s: pd.Series) -> pd.Series:
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series) -> pd.Series:
return s.astype(str).str.strip().str.upper()
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
# drop zero-variance predictors
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
"""Mean test R^2 using GroupKFold when possible; KFold fallback if too few groups."""
n_groups = int(pd.Series(groups).nunique())
if n_groups >= 2:
n_splits = min(max_folds, n_groups)
gkf = GroupKFold(n_splits=n_splits)
scores = []
mdl = LinearRegression()
for tr, te in gkf.split(X, y, groups=groups):
mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
y_pred = mdl.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
# fallback: plain KFold
n = len(X)
if n < 3:
return np.nan
n_splits = min(3, n)
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
scores = []
mdl = LinearRegression()
for tr, te in kf.split(X, y):
mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
y_pred = mdl.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
data = pd.concat([y, X], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
n, p = len(y_c), X_c.shape[1]
if p == 0 or n < max(10, p+2):
return dict(rows_used=int(n), features_used=int(p),
r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
mdl = LinearRegression().fit(X_c.values, y_c.values)
r2 = float(mdl.score(X_c.values, y_c.values))
adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
cv = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
return dict(rows_used=int(n), features_used=int(p),
r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)
# ---------- LOAD EVENT ----------
evt_path = find_file(EVENT_FILE)
if evt_path is None:
raise FileNotFoundError("event_study.xlsx not found in the configured folders.")
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
# ---------- RUN ----------
present = [f for f in FEATURE_FILES if find_file(f) is not None]
assert present, "Could not find features v1.2.xlsx or features v1.4.xlsx."
print("Testing files:", present)
merge_audit = []
results = []
for fname in present:
fpath = find_file(fname)
feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
df_feat_raw = feat_book[fsheet].copy()
dfeat = find_day0_column(df_feat_raw)
tfeat = find_ticker_column(df_feat_raw)
feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)
for w in WINDOWS:
esheet = win_map.get(w)
if esheet is None:
print(f"Missing event sheet for window {w}. Skipping.")
continue
df_evt = evt_book[esheet].copy()
devt = find_day0_column(df_evt)
tevt = find_ticker_column(df_evt)
ycol = find_target_col(df_evt)
evt = df_evt.copy()
evt["__day0__"] = normalize_day0(evt[devt])
evt["__ticker__"] = normalize_ticker(evt[tevt])
evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])
merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
groups = merged["__ticker__"]
X = build_X(merged, num_cols, ycol)
y = merged[ycol].astype(float)
merge_audit.append({
"features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
"day0_features_col": dfeat, "ticker_features_col": tfeat,
"day0_event_col": devt, "ticker_event_col": tevt,
"merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
})
m = fit_and_score(X, y, groups)
m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
results.append(m)
# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)
print("\nMerge audit:")
display(pd.DataFrame(merge_audit))
res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1.2 vs v1.4):")
display(res_df)
print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
columns="features_file",
values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
aggfunc="first")
display(wide)
# ---------- DELTAS (v1.4 - v1.2) ----------
pairs = []
for w in WINDOWS:
a = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
b = res_df[(res_df["features_file"]=="features v1.4.xlsx") & (res_df["window"]==w)]
if not a.empty and not b.empty:
pairs.append({
"window": w,
"delta_cross_validated_r_squared": float(b["cross_validated_r_squared"].iloc[0] - a["cross_validated_r_squared"].iloc[0]),
"delta_adjusted_r_squared": float(b["adjusted_r_squared"].iloc[0] - a["adjusted_r_squared"].iloc[0]),
"delta_r_squared": float(b["r_squared"].iloc[0] - a["r_squared"].iloc[0]),
"rows_used_v1.2": int(a["rows_used"].iloc[0]),
"rows_used_v1.4": int(b["rows_used"].iloc[0]),
"features_used_v1.2": int(a["features_used"].iloc[0]),
"features_used_v1.4": int(b["features_used"].iloc[0]),
})
if pairs:
deltas = pd.DataFrame(pairs)
print("\nDeltas (v1.4 minus v1.2) — positive is good:")
display(deltas)
# Save CSVs next to your data
out_dir = find_file(EVENT_FILE).parent
res_df.to_csv(out_dir / "v1.2_vs_v1.4_results.csv", index=False)
wide.to_csv(out_dir / "v1.2_vs_v1.4_comparison_table.csv")
print(f"\nSaved to: {out_dir}")
print(" - v1.2_vs_v1.4_results.csv")
print(" - v1.2_vs_v1.4_comparison_table.csv")
Testing files: ['features v1.2.xlsx', 'features v1.4.xlsx'] Merge audit:
| features_file | features_sheet | event_sheet | window | day0_features_col | ticker_features_col | day0_event_col | ticker_event_col | merged_rows | predictor_cols | target_col | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | features v1.2.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 1 | features v1.2.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 2 | features v1.2.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 3 | features v1.4.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 11 | CAR |
| 4 | features v1.4.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 11 | CAR |
| 5 | features v1.4.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 11 | CAR |
Results (v1.2 vs v1.4):
| rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | features_file | features_sheet | window | |
|---|---|---|---|---|---|---|---|---|
| 0 | 129 | 8 | 0.245160 | 0.194838 | 0.068034 | features v1.2.xlsx | features | 0,1 |
| 1 | 129 | 11 | 0.262463 | 0.193121 | -0.143365 | features v1.4.xlsx | features | 0,1 |
| 2 | 129 | 8 | 0.201481 | 0.148246 | 0.094267 | features v1.2.xlsx | features | 0,3 |
| 3 | 129 | 11 | 0.215955 | 0.142241 | -0.128261 | features v1.4.xlsx | features | 0,3 |
| 4 | 129 | 8 | 0.214735 | 0.162384 | 0.121771 | features v1.2.xlsx | features | 0,5 |
| 5 | 129 | 11 | 0.222797 | 0.149727 | -0.108542 | features v1.4.xlsx | features | 0,5 |
Comparison table (rows = windows | columns = metrics per file):
| adjusted_r_squared | cross_validated_r_squared | r_squared | ||||
|---|---|---|---|---|---|---|
| features_file | features v1.2.xlsx | features v1.4.xlsx | features v1.2.xlsx | features v1.4.xlsx | features v1.2.xlsx | features v1.4.xlsx |
| window | ||||||
| 0,1 | 0.194838 | 0.193121 | 0.068034 | -0.143365 | 0.245160 | 0.262463 |
| 0,3 | 0.148246 | 0.142241 | 0.094267 | -0.128261 | 0.201481 | 0.215955 |
| 0,5 | 0.162384 | 0.149727 | 0.121771 | -0.108542 | 0.214735 | 0.222797 |
Deltas (v1.4 minus v1.2) — positive is good:
| window | delta_cross_validated_r_squared | delta_adjusted_r_squared | delta_r_squared | rows_used_v1.2 | rows_used_v1.4 | features_used_v1.2 | features_used_v1.4 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0,1 | -0.211399 | -0.001716 | 0.017302 | 129 | 129 | 8 | 11 |
| 1 | 0,3 | -0.222528 | -0.006006 | 0.014474 | 129 | 129 | 8 | 11 |
| 2 | 0,5 | -0.230313 | -0.012657 | 0.008063 | 129 | 129 | 8 | 11 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - v1.2_vs_v1.4_results.csv - v1.2_vs_v1.4_comparison_table.csv
In [23]:
# === Compare features v1.2 vs features v1.5 (join on day0 + ticker) ===
# Metrics: coefficient of determination, adjusted coefficient of determination,
# cross validated coefficient of determination (ticker-aware)
# If needed first: pip install pandas numpy scikit-learn openpyxl
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
# ---------- CONFIG ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.2.xlsx", "features v1.5.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5
# ---------- HELPERS ----------
def find_file(name: str):
for b in BASE_DIRS:
p = b / name
if p.exists():
return p
return None
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands:
return next(iter(book))
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
m = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for name in book.keys():
if is_readme_sheet(name):
continue
for w, pat in pats.items():
if m[w] is None and pat.search(str(name)):
m[w] = name
return m
def find_day0_column(df: pd.DataFrame):
cols = [str(c) for c in df.columns]
strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
best, best_nonnull = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > best_nonnull:
best, best_nonnull = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score:
best, score = c, sc
return best
def normalize_day0(s: pd.Series) -> pd.Series:
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series) -> pd.Series:
return s.astype(str).str.strip().str.upper()
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
"""Mean test coefficient of determination using group folds when possible; row folds fallback if too few groups."""
n_groups = int(pd.Series(groups).nunique())
if n_groups >= 2:
n_splits = min(max_folds, n_groups)
gkf = GroupKFold(n_splits=n_splits)
scores = []
mdl = LinearRegression()
for tr, te in gkf.split(X, y, groups=groups):
mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
y_pred = mdl.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
# fallback to ordinary KFold on rows
n = len(X)
if n < 3:
return np.nan
n_splits = min(3, n)
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
scores = []
mdl = LinearRegression()
for tr, te in kf.split(X, y):
mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
y_pred = mdl.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
data = pd.concat([y, X], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
n, p = len(y_c), X_c.shape[1]
if p == 0 or n < max(10, p+2):
return dict(rows_used=int(n), features_used=int(p),
r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
mdl = LinearRegression().fit(X_c.values, y_c.values)
r2 = float(mdl.score(X_c.values, y_c.values))
adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
cv = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
return dict(rows_used=int(n), features_used=int(p),
r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)
# ---------- LOAD EVENT ----------
evt_path = find_file(EVENT_FILE)
if evt_path is None:
raise FileNotFoundError("event_study.xlsx not found in the configured folders.")
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
# ---------- RUN ----------
present = [f for f in FEATURE_FILES if find_file(f) is not None]
assert present, "Could not find features v1.2.xlsx or features v1.5.xlsx."
print("Testing files:", present)
merge_audit = []
results = []
for fname in present:
fpath = find_file(fname)
feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
df_feat_raw = feat_book[fsheet].copy()
dfeat = find_day0_column(df_feat_raw)
tfeat = find_ticker_column(df_feat_raw)
feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)
for w in WINDOWS:
esheet = win_map.get(w)
if esheet is None:
print(f"Missing event sheet for window {w}. Skipping.")
continue
df_evt = evt_book[esheet].copy()
devt = find_day0_column(df_evt)
tevt = find_ticker_column(df_evt)
ycol = find_target_col(df_evt)
evt = df_evt.copy()
evt["__day0__"] = normalize_day0(evt[devt])
evt["__ticker__"] = normalize_ticker(evt[tevt])
evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])
merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
groups = merged["__ticker__"]
X = build_X(merged, num_cols, ycol)
y = merged[ycol].astype(float)
merge_audit.append({
"features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
"day0_features_col": dfeat, "ticker_features_col": tfeat,
"day0_event_col": devt, "ticker_event_col": tevt,
"merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
})
m = fit_and_score(X, y, groups)
m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
results.append(m)
# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)
print("\nMerge audit:")
display(pd.DataFrame(merge_audit))
res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1.2 vs v1.5):")
display(res_df)
print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
columns="features_file",
values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
aggfunc="first")
display(wide)
# ---------- DELTAS (v1.5 minus v1.2) ----------
pairs = []
for w in WINDOWS:
a = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
b = res_df[(res_df["features_file"]=="features v1.5.xlsx") & (res_df["window"]==w)]
if not a.empty and not b.empty:
pairs.append({
"window": w,
"delta_cross_validated_r_squared": float(b["cross_validated_r_squared"].iloc[0] - a["cross_validated_r_squared"].iloc[0]),
"delta_adjusted_r_squared": float(b["adjusted_r_squared"].iloc[0] - a["adjusted_r_squared"].iloc[0]),
"delta_r_squared": float(b["r_squared"].iloc[0] - a["r_squared"].iloc[0]),
"rows_used_v1.2": int(a["rows_used"].iloc[0]),
"rows_used_v1.5": int(b["rows_used"].iloc[0]),
"features_used_v1.2": int(a["features_used"].iloc[0]),
"features_used_v1.5": int(b["features_used"].iloc[0]),
})
if pairs:
deltas = pd.DataFrame(pairs)
print("\nDeltas (v1.5 minus v1.2) — positive is good:")
display(deltas)
# Save CSVs next to your data
out_dir = find_file(EVENT_FILE).parent
res_df.to_csv(out_dir / "v1.2_vs_v1.5_results.csv", index=False)
wide.to_csv(out_dir / "v1.2_vs_v1.5_comparison_table.csv")
print(f"\nSaved to: {out_dir}")
print(" - v1.2_vs_v1.5_results.csv")
print(" - v1.2_vs_v1.5_comparison_table.csv")
Testing files: ['features v1.2.xlsx', 'features v1.5.xlsx'] Merge audit:
| features_file | features_sheet | event_sheet | window | day0_features_col | ticker_features_col | day0_event_col | ticker_event_col | merged_rows | predictor_cols | target_col | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | features v1.2.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 1 | features v1.2.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 2 | features v1.2.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 3 | features v1.5.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 1 | CAR |
| 4 | features v1.5.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 1 | CAR |
| 5 | features v1.5.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 1 | CAR |
Results (v1.2 vs v1.5):
| rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | features_file | features_sheet | window | |
|---|---|---|---|---|---|---|---|---|
| 0 | 129 | 8 | 0.245160 | 0.194838 | 0.068034 | features v1.2.xlsx | features | 0,1 |
| 1 | 129 | 1 | 0.082047 | 0.074819 | -0.060921 | features v1.5.xlsx | features | 0,1 |
| 2 | 129 | 8 | 0.201481 | 0.148246 | 0.094267 | features v1.2.xlsx | features | 0,3 |
| 3 | 129 | 1 | 0.056217 | 0.048786 | -0.051390 | features v1.5.xlsx | features | 0,3 |
| 4 | 129 | 8 | 0.214735 | 0.162384 | 0.121771 | features v1.2.xlsx | features | 0,5 |
| 5 | 129 | 1 | 0.059075 | 0.051666 | -0.034691 | features v1.5.xlsx | features | 0,5 |
Comparison table (rows = windows | columns = metrics per file):
| adjusted_r_squared | cross_validated_r_squared | r_squared | ||||
|---|---|---|---|---|---|---|
| features_file | features v1.2.xlsx | features v1.5.xlsx | features v1.2.xlsx | features v1.5.xlsx | features v1.2.xlsx | features v1.5.xlsx |
| window | ||||||
| 0,1 | 0.194838 | 0.074819 | 0.068034 | -0.060921 | 0.245160 | 0.082047 |
| 0,3 | 0.148246 | 0.048786 | 0.094267 | -0.051390 | 0.201481 | 0.056217 |
| 0,5 | 0.162384 | 0.051666 | 0.121771 | -0.034691 | 0.214735 | 0.059075 |
Deltas (v1.5 minus v1.2) — positive is good:
| window | delta_cross_validated_r_squared | delta_adjusted_r_squared | delta_r_squared | rows_used_v1.2 | rows_used_v1.5 | features_used_v1.2 | features_used_v1.5 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0,1 | -0.128955 | -0.120019 | -0.163114 | 129 | 129 | 8 | 1 |
| 1 | 0,3 | -0.145657 | -0.099461 | -0.145264 | 129 | 129 | 8 | 1 |
| 2 | 0,5 | -0.156462 | -0.110718 | -0.155660 | 129 | 129 | 8 | 1 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - v1.2_vs_v1.5_results.csv - v1.2_vs_v1.5_comparison_table.csv
In [25]:
# === Compare features v1.2 vs v1.5 vs v1.6 (join on day0 + ticker) ===
# Metrics: R^2, Adjusted R^2, Grouped Cross-Validated R^2
# If needed first: pip install pandas numpy scikit-learn openpyxl
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
# ---------- CONFIG ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.2.xlsx", "features v1.5.xlsx", "features v1.6.xlsx"]
WINDOWS = ["0,1", "0,3", "0,5"]
MAX_GROUP_FOLDS = 5
# ---------- HELPERS ----------
def find_file(name: str):
for b in BASE_DIRS:
p = (b / name)
if p.exists(): return p
return None
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands: return next(iter(book))
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
out = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for nm in book:
if is_readme_sheet(nm): continue
for w, pat in pats.items():
if out[w] is None and pat.search(str(nm)): out[w] = nm
return out
def find_day0_column(df: pd.DataFrame):
cols = [str(c) for c in df.columns]
strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
# pick most date-like
best, best_nonnull = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > best_nonnull: best, best_nonnull = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score: best, score = c, sc
return best
def normalize_day0(s: pd.Series):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series):
return s.astype(str).str.strip().str.upper()
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
n_groups = int(pd.Series(groups).nunique())
mdl = LinearRegression()
if n_groups >= 2:
n_splits = min(max_folds, n_groups)
gkf = GroupKFold(n_splits=n_splits)
scores = []
for tr, te in gkf.split(X, y, groups=groups):
mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
y_pred = mdl.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
scores.append(1 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
# fallback to KFold on rows
n = len(X)
if n < 3: return np.nan
kf = KFold(n_splits=min(3, n), shuffle=True, random_state=42)
scores = []
for tr, te in kf.split(X, y):
mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
y_pred = mdl.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
scores.append(1 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
data = pd.concat([y, X], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
n, p = len(y_c), X_c.shape[1]
if p == 0 or n < max(10, p+2):
return dict(rows_used=int(n), features_used=int(p),
r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
mdl = LinearRegression().fit(X_c.values, y_c.values)
r2 = float(mdl.score(X_c.values, y_c.values))
adj = 1 - (1 - r2)*(n - 1)/(n - p - 1) if (n - p - 1) > 0 else np.nan
cv = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
return dict(rows_used=int(n), features_used=int(p),
r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)
# ---------- LOAD EVENT ----------
evt_path = find_file(EVENT_FILE)
if evt_path is None:
raise FileNotFoundError("event_study.xlsx not found.")
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
# ---------- RUN ----------
present = [f for f in FEATURE_FILES if find_file(f) is not None]
assert present, "Could not find any of: features v1.2.xlsx, v1.5.xlsx, v1.6.xlsx"
print("Testing files:", present)
merge_audit, results = [], []
for fname in present:
fpath = find_file(fname)
feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
df_feat_raw = feat_book[fsheet].copy()
dfeat = find_day0_column(df_feat_raw)
tfeat = find_ticker_column(df_feat_raw)
feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)
for w in WINDOWS:
esheet = win_map.get(w)
if esheet is None:
print(f"Missing event sheet for window {w}. Skipping.")
continue
df_evt = evt_book[esheet].copy()
devt = find_day0_column(df_evt)
tevt = find_ticker_column(df_evt)
ycol = find_target_col(df_evt)
evt = df_evt.copy()
evt["__day0__"] = normalize_day0(evt[devt])
evt["__ticker__"] = normalize_ticker(evt[tevt])
evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])
merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
groups = merged["__ticker__"]
X = build_X(merged, num_cols, ycol)
y = merged[ycol].astype(float)
merge_audit.append({
"features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
"day0_features_col": dfeat, "ticker_features_col": tfeat,
"day0_event_col": devt, "ticker_event_col": tevt,
"merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
})
m = fit_and_score(X, y, groups)
m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
results.append(m)
# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)
print("\nMerge audit:")
display(pd.DataFrame(merge_audit))
res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1.2 vs v1.5 vs v1.6):")
display(res_df)
print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
columns="features_file",
values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
aggfunc="first")
display(wide)
# ---------- DELTAS vs baseline v1.2 ----------
pairs = []
for w in WINDOWS:
base = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
for alt in ["features v1.5.xlsx", "features v1.6.xlsx"]:
comp = res_df[(res_df["features_file"]==alt) & (res_df["window"]==w)]
if not base.empty and not comp.empty:
pairs.append({
"window": w,
"model_vs_v1.2": alt,
"delta_cross_validated_r_squared": float(comp["cross_validated_r_squared"].iloc[0] - base["cross_validated_r_squared"].iloc[0]),
"delta_adjusted_r_squared": float(comp["adjusted_r_squared"].iloc[0] - base["adjusted_r_squared"].iloc[0]),
"delta_r_squared": float(comp["r_squared"].iloc[0] - base["r_squared"].iloc[0]),
"rows_used_base": int(base["rows_used"].iloc[0]),
"rows_used_alt": int(comp["rows_used"].iloc[0]),
"features_used_base": int(base["features_used"].iloc[0]),
"features_used_alt": int(comp["features_used"].iloc[0]),
})
if pairs:
deltas = pd.DataFrame(pairs).sort_values(["window","model_vs_v1.2"]).reset_index(drop=True)
print("\nDeltas vs v1.2 — positive is good:")
display(deltas)
# ---------- SAVE ----------
out_dir = find_file(EVENT_FILE).parent
res_df.to_csv(out_dir / "v1.2_v1.5_v1.6_results.csv", index=False)
wide.to_csv(out_dir / "v1.2_v1.5_v1.6_comparison_table.csv")
if pairs:
deltas.to_csv(out_dir / "v1.2_v1.5_v1.6_deltas_vs_v12.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v1.2_v1.5_v1.6_results.csv")
print(" - v1.2_v1.5_v1.6_comparison_table.csv")
print(" - v1.2_v1.5_v1.6_deltas_vs_v12.csv")
Testing files: ['features v1.2.xlsx', 'features v1.5.xlsx', 'features v1.6.xlsx'] Merge audit:
| features_file | features_sheet | event_sheet | window | day0_features_col | ticker_features_col | day0_event_col | ticker_event_col | merged_rows | predictor_cols | target_col | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | features v1.2.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 1 | features v1.2.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 2 | features v1.2.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 3 | features v1.5.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 1 | CAR |
| 4 | features v1.5.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 1 | CAR |
| 5 | features v1.5.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 1 | CAR |
| 6 | features v1.6.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 7 | CAR |
| 7 | features v1.6.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 7 | CAR |
| 8 | features v1.6.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 7 | CAR |
Results (v1.2 vs v1.5 vs v1.6):
| rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | features_file | features_sheet | window | |
|---|---|---|---|---|---|---|---|---|
| 0 | 129 | 8 | 0.245160 | 0.194838 | 0.068034 | features v1.2.xlsx | features | 0,1 |
| 1 | 129 | 1 | 0.082047 | 0.074819 | -0.060921 | features v1.5.xlsx | features | 0,1 |
| 2 | 129 | 7 | 0.171406 | 0.123471 | 0.029867 | features v1.6.xlsx | features | 0,1 |
| 3 | 129 | 8 | 0.201481 | 0.148246 | 0.094267 | features v1.2.xlsx | features | 0,3 |
| 4 | 129 | 1 | 0.056217 | 0.048786 | -0.051390 | features v1.5.xlsx | features | 0,3 |
| 5 | 129 | 7 | 0.149466 | 0.100261 | 0.054352 | features v1.6.xlsx | features | 0,3 |
| 6 | 129 | 8 | 0.214735 | 0.162384 | 0.121771 | features v1.2.xlsx | features | 0,5 |
| 7 | 129 | 1 | 0.059075 | 0.051666 | -0.034691 | features v1.5.xlsx | features | 0,5 |
| 8 | 129 | 7 | 0.161872 | 0.113385 | 0.075321 | features v1.6.xlsx | features | 0,5 |
Comparison table (rows = windows | columns = metrics per file):
| adjusted_r_squared | cross_validated_r_squared | r_squared | |||||||
|---|---|---|---|---|---|---|---|---|---|
| features_file | features v1.2.xlsx | features v1.5.xlsx | features v1.6.xlsx | features v1.2.xlsx | features v1.5.xlsx | features v1.6.xlsx | features v1.2.xlsx | features v1.5.xlsx | features v1.6.xlsx |
| window | |||||||||
| 0,1 | 0.194838 | 0.074819 | 0.123471 | 0.068034 | -0.060921 | 0.029867 | 0.245160 | 0.082047 | 0.171406 |
| 0,3 | 0.148246 | 0.048786 | 0.100261 | 0.094267 | -0.051390 | 0.054352 | 0.201481 | 0.056217 | 0.149466 |
| 0,5 | 0.162384 | 0.051666 | 0.113385 | 0.121771 | -0.034691 | 0.075321 | 0.214735 | 0.059075 | 0.161872 |
Deltas vs v1.2 — positive is good:
| window | model_vs_v1.2 | delta_cross_validated_r_squared | delta_adjusted_r_squared | delta_r_squared | rows_used_base | rows_used_alt | features_used_base | features_used_alt | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0,1 | features v1.5.xlsx | -0.128955 | -0.120019 | -0.163114 | 129 | 129 | 8 | 1 |
| 1 | 0,1 | features v1.6.xlsx | -0.038166 | -0.071367 | -0.073754 | 129 | 129 | 8 | 7 |
| 2 | 0,3 | features v1.5.xlsx | -0.145657 | -0.099461 | -0.145264 | 129 | 129 | 8 | 1 |
| 3 | 0,3 | features v1.6.xlsx | -0.039915 | -0.047985 | -0.052015 | 129 | 129 | 8 | 7 |
| 4 | 0,5 | features v1.5.xlsx | -0.156462 | -0.110718 | -0.155660 | 129 | 129 | 8 | 1 |
| 5 | 0,5 | features v1.6.xlsx | -0.046450 | -0.048998 | -0.052863 | 129 | 129 | 8 | 7 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - v1.2_v1.5_v1.6_results.csv - v1.2_v1.5_v1.6_comparison_table.csv - v1.2_v1.5_v1.6_deltas_vs_v12.csv
In [27]:
# === Grow v1.2 by adding features from v3, while ALWAYS including EPS surprise pct ===
# Outputs:
# - summary (baseline vs improved) per window
# - quick one-at-a-time gains for all v3 candidates (vs baseline)
# - features selected by nested forward selection (with fold frequencies)
# - CSVs saved next to event_study.xlsx
#
# If needed first: pip install pandas numpy scikit-learn openpyxl
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
# ---------- CONFIG ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
BASE_FILE = "features v1.2.xlsx" # baseline feature set
EPS_FILE = "features v1.5.xlsx" # has EPS surprise pct (used if base lacks it)
POOL_FILE = "features v3.xlsx" # candidate features to try adding
WINDOWS = ["0,1", "0,3", "0,5"]
MAX_OUTER_FOLDS = 5 # grouped by ticker
MAX_INNER_FOLDS = 3 # grouped by ticker on the training fold
MAX_FEATURES_TO_ADD = 5 # cap number of added v3 features
MIN_GAIN = 0.01 # require at least +0.01 CV R^2 (by ticker) to add
# ---------- HELPERS ----------
def find_file(name: str):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(f"Could not find: {name}")
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands: return next(iter(book))
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
m = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for name in book.keys():
if is_readme_sheet(name): continue
for w, pat in pats.items():
if m[w] is None and pat.search(str(name)): m[w] = name
return m
def find_day0_column(df: pd.DataFrame):
cols = [str(c) for c in df.columns]
strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
# most date-like
best, kbest = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > kbest: best, kbest = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score: best, score = c, sc
return best
def normalize_day0(s: pd.Series) -> pd.Series:
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series) -> pd.Series:
return s.astype(str).str.strip().str.upper()
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def safe_group_cv_scores(X, y, groups, max_splits=5, min_splits=2):
"""Return mean test R^2 and the splitter used. Uses GroupKFold when possible; KFold fallback."""
n_groups = int(pd.Series(groups).nunique())
if n_groups >= min_splits:
n_splits = min(max_splits, n_groups)
splitter = GroupKFold(n_splits=n_splits)
model = LinearRegression()
scores = []
for tr, te in splitter.split(X, y, groups=groups):
model.fit(X.iloc[tr].values, y.iloc[tr].values)
y_hat = model.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_hat)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores)), splitter
# fallback to KFold
n = len(X)
if n < 3: return np.nan, None
splitter = KFold(n_splits=min(3, n), shuffle=True, random_state=42)
model = LinearRegression()
scores = []
for tr, te in splitter.split(X, y):
model.fit(X.iloc[tr].values, y.iloc[tr].values)
y_hat = model.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_hat)**2); ss_tot = np.sum((y_true - np.mean(y_true))**2)
scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores)), splitter
def in_sample_and_adjusted(X: pd.DataFrame, y: pd.Series):
if X.shape[1] == 0: return np.nan, np.nan
mdl = LinearRegression().fit(X.values, y.values)
r2 = float(mdl.score(X.values, y.values))
n, p = len(y), X.shape[1]
adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
return r2, adj
def find_eps_column(df: pd.DataFrame):
# Look for something like "EPS surprise pct", case-insensitive, flexible wording
pats = [
r"eps.*surpris.*(pct|percent|%)",
r"earnings.*surpris.*(pct|percent|%)",
r"eps[_\s]*surpris", # fallback
r"surpris[_\s]*(pct|percent|%)"
]
nums = df.select_dtypes(include=[np.number]).columns
for pat in pats:
cands = [c for c in df.columns if re.search(pat, str(c), flags=re.IGNORECASE)]
cands = [c for c in cands if c in nums]
if cands:
return cands[0]
return None
# ---------- LOAD BOOKS ----------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
base_path = find_file(BASE_FILE)
base_book = pd.read_excel(base_path, sheet_name=None, engine="openpyxl")
base_sheet = choose_features_sheet(base_book)
base_raw = base_book[base_sheet].copy()
base_day0 = find_day0_column(base_raw)
base_tick = find_ticker_column(base_raw)
base_grp, base_num_cols = aggregate_features(base_raw, base_day0, base_tick)
pool_path = find_file(POOL_FILE)
pool_book = pd.read_excel(pool_path, sheet_name=None, engine="openpyxl")
pool_sheet = choose_features_sheet(pool_book)
pool_raw = pool_book[pool_sheet].copy()
pool_day0 = find_day0_column(pool_raw)
pool_tick = find_ticker_column(pool_raw)
pool_grp, pool_num_cols = aggregate_features(pool_raw, pool_day0, pool_tick)
# EPS column: try base first, else v1.5, else v3/pool
eps_col = find_eps_column(base_grp)
if eps_col is None:
try:
eps_path = find_file(EPS_FILE)
eps_book = pd.read_excel(eps_path, sheet_name=None, engine="openpyxl")
eps_sheet = choose_features_sheet(eps_book)
eps_raw = eps_book[eps_sheet].copy()
eps_day0 = find_day0_column(eps_raw); eps_tick = find_ticker_column(eps_raw)
eps_grp, _ = aggregate_features(eps_raw, eps_day0, eps_tick)
eps_col = find_eps_column(eps_grp)
if eps_col is None:
raise ValueError("Could not find EPS surprise pct in v1.5.")
except Exception:
eps_col = find_eps_column(pool_grp)
if eps_col is None:
raise ValueError("Could not find an EPS surprise pct column in base, v1.5, or v3.")
eps_grp = pool_grp[["__day0__","__ticker__", eps_col]].copy()
else:
eps_grp = base_grp[["__day0__","__ticker__", eps_col]].copy()
# Candidate features from v3 (exclude anything already in base or the EPS column)
candidate_cols = [c for c in pool_num_cols if c not in set(base_num_cols) and c != eps_col]
# ---------- WORK PER WINDOW ----------
all_quick = []
all_selected = []
all_summary = []
for window in WINDOWS:
esheet = win_map.get(window)
if esheet is None:
print(f"Skip window {window}: event sheet not found.")
continue
df_evt = evt_book[esheet].copy()
evt_day0 = find_day0_column(df_evt); evt_tick = find_ticker_column(df_evt); y_col = find_target_col(df_evt)
evt = df_evt.copy()
evt["__day0__"] = normalize_day0(evt[evt_day0])
evt["__ticker__"] = normalize_ticker(evt[evt_tick])
evt = evt.dropna(subset=["__day0__","__ticker__", y_col]).drop_duplicates(subset=["__day0__","__ticker__"])
# Merge base, EPS, and pool keys
merged_base = base_grp.merge(eps_grp[["__day0__","__ticker__", eps_col]], on=["__day0__","__ticker__"], how="left")
merged_base = merged_base.merge(evt[["__day0__","__ticker__", y_col]], on=["__day0__","__ticker__"], how="inner")
merged_pool = pool_grp[["__day0__","__ticker__"] + candidate_cols]
merged = merged_base.merge(merged_pool, on=["__day0__","__ticker__"], how="left")
# Build BASE = v1.2 features + EPS column
base_plus_eps_cols = list(dict.fromkeys(base_num_cols + [eps_col])) # keep order, drop dup
X_base = build_X(merged, base_plus_eps_cols, y_col)
y = merged[y_col].astype(float)
groups = merged["__ticker__"]
# Baseline scores
base_cv, outer_splitter = safe_group_cv_scores(X_base, y, groups, max_splits=MAX_OUTER_FOLDS, min_splits=2)
base_r2, base_adj = in_sample_and_adjusted(X_base, y)
# ---- QUICK ONE-AT-A-TIME GAINS (add each v3 candidate to base+EPS) ----
quick_rows = []
for c in candidate_cols:
if c not in merged.columns: continue
Xt = pd.concat([X_base, merged[[c]]], axis=1)
data = pd.concat([y, Xt], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
if X_c.shape[1] == 0 or len(y_c) < 10: continue
cv_r2, _ = safe_group_cv_scores(X_c, y_c, groups.loc[X_c.index], max_splits=MAX_OUTER_FOLDS, min_splits=2)
quick_rows.append({"window": window, "feature": c, "cv_with_feature": cv_r2, "delta": cv_r2 - base_cv})
quick_df = pd.DataFrame(quick_rows).sort_values(["window","delta"], ascending=[True, False]).reset_index(drop=True)
all_quick.append(quick_df)
# ---- NESTED FORWARD SELECTION (start from base+EPS, add v3 features if they help) ----
# Build outer splits
if outer_splitter is None:
splits = []
elif isinstance(outer_splitter, GroupKFold):
splits = list(outer_splitter.split(X_base, y, groups=groups))
else:
splits = list(outer_splitter.split(X_base, y))
outer_scores = []
fold_selected = []
for tr, te in splits:
Xb_tr, Xb_te = X_base.iloc[tr], X_base.iloc[te]
y_tr, y_te = y.iloc[tr], y.iloc[te]
groups_tr = groups.iloc[tr]
def inner_cv_score(Xt, yt):
return safe_group_cv_scores(Xt, yt, groups_tr.loc[Xt.index], max_splits=MAX_INNER_FOLDS, min_splits=2)[0]
# starting point = base+EPS on training data
data_tr = pd.concat([y_tr, Xb_tr], axis=1).dropna()
y_tr_c, X_tr_c = data_tr.iloc[:,0], data_tr.iloc[:,1:]
base_inner = inner_cv_score(X_tr_c, y_tr_c)
avail = [c for c in candidate_cols if c in merged.columns]
chosen = []
for _ in range(MAX_FEATURES_TO_ADD):
best_gain, best_feat = 0.0, None
for c in avail:
col = merged.loc[Xb_tr.index, c]
Xt = pd.concat([X_tr_c, col], axis=1).dropna()
yt = y_tr.loc[Xt.index]
if Xt.shape[1] == 0 or len(yt) < 10:
continue
score = inner_cv_score(Xt, yt)
gain = score - base_inner
if gain > best_gain:
best_gain, best_feat = gain, c
if best_feat is None or best_gain < MIN_GAIN:
break
# accept the feature
chosen.append(best_feat)
avail.remove(best_feat)
X_tr_c = pd.concat([X_tr_c, merged.loc[Xb_tr.index, [best_feat]]], axis=1).dropna()
y_tr_c = y_tr.loc[X_tr_c.index]
base_inner = inner_cv_score(X_tr_c, y_tr_c)
fold_selected.append(chosen)
# evaluate on the outer test fold
X_te = Xb_te.copy()
if chosen:
X_te = pd.concat([X_te, merged.loc[Xb_te.index, chosen]], axis=1)
data_te = pd.concat([y_te, X_te], axis=1).dropna()
y_te_c, X_te_c = data_te.iloc[:,0], data_te.iloc[:,1:]
if X_te_c.shape[1] == 0 or len(y_te_c) < 2:
outer_scores.append(np.nan)
else:
# use training-fitted model on the final training matrix
mdl = LinearRegression().fit(X_tr_c.values, y_tr_c.values)
y_hat = mdl.predict(X_te_c.values)
ss_res = np.sum((y_te_c.values - y_hat)**2)
ss_tot = np.sum((y_te_c.values - np.mean(y_te_c.values))**2)
outer_scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
# count selections across folds
flat = [f for sub in fold_selected for f in sub]
freq = pd.Series(flat).value_counts().rename("selected_in_folds").to_frame()
freq["window"] = window
freq = freq.reset_index().rename(columns={"index":"feature"})
all_selected.append(freq)
# union of features picked in at least half the folds
keep_union = []
if not freq.empty and len(splits) > 0:
half = max(1, int(np.ceil(len(splits)/2)))
keep_union = freq.loc[freq["selected_in_folds"] >= half, "feature"].tolist()
# evaluate union model (base+EPS + union additions) on full sample CV
X_full = X_base.copy()
if keep_union:
X_full = pd.concat([X_full, merged[keep_union]], axis=1)
data_full = pd.concat([y, X_full], axis=1).dropna()
y_full, X_full_c = data_full.iloc[:,0], data_full.iloc[:,1:]
full_cv, _ = safe_group_cv_scores(X_full_c, y_full, groups.loc[X_full_c.index], max_splits=MAX_OUTER_FOLDS, min_splits=2)
full_r2, full_adj = in_sample_and_adjusted(X_full_c, y_full)
all_summary.append({
"window": window,
"baseline_cross_validated_r_squared": base_cv,
"baseline_r_squared": base_r2,
"baseline_adjusted_r_squared": base_adj,
"selected_union_features": ", ".join(keep_union) if keep_union else "",
"n_selected_union": len(keep_union),
"union_model_cross_validated_r_squared": full_cv,
"union_model_r_squared": full_r2,
"union_model_adjusted_r_squared": full_adj,
"nested_forward_mean_test_cross_validated_r_squared": float(np.nanmean(outer_scores)) if outer_scores else np.nan
})
# ---------- REPORT ----------
quick_all = pd.concat(all_quick, ignore_index=True) if all_quick else pd.DataFrame()
selected_all= pd.concat(all_selected, ignore_index=True) if all_selected else pd.DataFrame()
summary = pd.DataFrame(all_summary)
pd.set_option("display.max_rows", 200)
pd.set_option("display.max_columns", None)
print("\n=== Summary: baseline (v1.2 + EPS) vs improved (added v3 features) ===")
display(summary.sort_values("window"))
if not quick_all.empty:
print("\n=== Quick marginal gains (top 25 per window) — delta vs baseline CV R^2 ===")
display(quick_all.sort_values(["window","delta"], ascending=[True, False]).groupby("window").head(25))
if not selected_all.empty:
print("\n=== Features selected by nested forward selection (freq across outer folds) ===")
display(selected_all.sort_values(["window","selected_in_folds"], ascending=[True, False]))
# ---------- SAVE ----------
out_dir = evt_path.parent
summary.to_csv(out_dir / "v12_plus_EPS_growth_summary.csv", index=False)
if not quick_all.empty:
quick_all.to_csv(out_dir / "v12_plus_EPS_quick_gains.csv", index=False)
if not selected_all.empty:
selected_all.to_csv(out_dir / "v12_plus_EPS_selected_freq.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v12_plus_EPS_growth_summary.csv")
print(" - v12_plus_EPS_quick_gains.csv")
print(" - v12_plus_EPS_selected_freq.csv")
=== Summary: baseline (v1.2 + EPS) vs improved (added v3 features) ===
| window | baseline_cross_validated_r_squared | baseline_r_squared | baseline_adjusted_r_squared | selected_union_features | n_selected_union | union_model_cross_validated_r_squared | union_model_r_squared | union_model_adjusted_r_squared | nested_forward_mean_test_cross_validated_r_squared | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0,1 | 0.029867 | 0.171406 | 0.123471 | pre_vol_10d | 1 | 0.032993 | 0.184116 | 0.129723 | -0.221901 |
| 1 | 0,3 | 0.054352 | 0.149466 | 0.100261 | pre_vol_10d | 1 | 0.053016 | 0.158257 | 0.102140 | -0.207395 |
| 2 | 0,5 | 0.075321 | 0.161872 | 0.113385 | pre_vol_10d | 1 | 0.075510 | 0.166639 | 0.111082 | -0.199900 |
=== Quick marginal gains (top 25 per window) — delta vs baseline CV R^2 ===
| window | feature | cv_with_feature | delta | |
|---|---|---|---|---|
| 0 | 0,1 | cpi_x_prevol5d | 0.043880 | 1.401278e-02 |
| 1 | 0,1 | pre_vol_10d | 0.032993 | 3.125374e-03 |
| 2 | 0,1 | macro_fedfunds | 0.032806 | 2.938385e-03 |
| 3 | 0,1 | is_friday | 0.029867 | -5.932754e-16 |
| 4 | 0,1 | is_amc | 0.029867 | -5.932754e-16 |
| 5 | 0,1 | is_bmo | 0.029867 | -5.932754e-16 |
| 6 | 0,1 | high_vix_regime | 0.024420 | -5.447331e-03 |
| 7 | 0,1 | pre_ret_10d | 0.023300 | -6.567111e-03 |
| 8 | 0,1 | rates_x_surprise | 0.021839 | -8.028273e-03 |
| 9 | 0,1 | is_monday | 0.021704 | -8.162976e-03 |
| 10 | 0,1 | is_january | 0.021129 | -8.738139e-03 |
| 11 | 0,1 | pre_vol_3d | 0.017442 | -1.242591e-02 |
| 12 | 0,1 | macro_cpi_yoy | 0.013081 | -1.678638e-02 |
| 13 | 0,1 | high_density_week | 0.008183 | -2.168488e-02 |
| 14 | 0,1 | weekly_density | 0.008183 | -2.168488e-02 |
| 15 | 0,1 | vix_chg_10d_lag1 | 0.005445 | -2.442203e-02 |
| 16 | 0,1 | mkt_ret_10d_lag1 | 0.002847 | -2.702007e-02 |
| 17 | 0,1 | mkt_ret_1d_lag1 | 0.001018 | -2.884926e-02 |
| 18 | 0,1 | high_rates_regime | -0.000556 | -3.042343e-02 |
| 19 | 0,1 | cpi_x_surprise | -0.002427 | -3.229439e-02 |
| 20 | 0,1 | month | -0.015157 | -4.502429e-02 |
| 21 | 0,1 | quarter | -0.021307 | -5.117470e-02 |
| 22 | 0,1 | vix_x_prevol5d | -0.037740 | -6.760706e-02 |
| 23 | 0,1 | vix_x_surprise | -0.041374 | -7.124191e-02 |
| 24 | 0,1 | day_of_week | -0.042865 | -7.273295e-02 |
| 36 | 0,3 | is_monday | 0.060106 | 5.753802e-03 |
| 37 | 0,3 | is_friday | 0.054352 | -9.228729e-16 |
| 38 | 0,3 | is_amc | 0.054352 | -9.228729e-16 |
| 39 | 0,3 | is_bmo | 0.054352 | -9.228729e-16 |
| 40 | 0,3 | pre_vol_10d | 0.053016 | -1.336237e-03 |
| 41 | 0,3 | macro_fedfunds | 0.052784 | -1.568323e-03 |
| 42 | 0,3 | cpi_x_prevol5d | 0.051568 | -2.784127e-03 |
| 43 | 0,3 | is_january | 0.046472 | -7.880373e-03 |
| 44 | 0,3 | pre_vol_3d | 0.045408 | -8.943739e-03 |
| 45 | 0,3 | high_vix_regime | 0.044154 | -1.019809e-02 |
| 46 | 0,3 | rates_x_surprise | 0.040254 | -1.409774e-02 |
| 47 | 0,3 | high_rates_regime | 0.032729 | -2.162309e-02 |
| 48 | 0,3 | macro_cpi_yoy | 0.031711 | -2.264069e-02 |
| 49 | 0,3 | pre_ret_10d | 0.030065 | -2.428709e-02 |
| 50 | 0,3 | cpi_x_surprise | 0.025314 | -2.903777e-02 |
| 51 | 0,3 | mkt_ret_1d_lag1 | 0.023783 | -3.056916e-02 |
| 52 | 0,3 | high_density_week | 0.011312 | -4.304005e-02 |
| 53 | 0,3 | weekly_density | 0.011312 | -4.304005e-02 |
| 54 | 0,3 | day_of_week | 0.009372 | -4.498044e-02 |
| 55 | 0,3 | month | 0.008169 | -4.618267e-02 |
| 56 | 0,3 | baa_minus_aaa_bp | 0.002811 | -5.154114e-02 |
| 57 | 0,3 | baa_minus_aaa_pct | 0.002811 | -5.154114e-02 |
| 58 | 0,3 | mkt_ret_10d_lag1 | 0.002333 | -5.201875e-02 |
| 59 | 0,3 | quarter | -0.001250 | -5.560216e-02 |
| 60 | 0,3 | investment_grade_option_adjusted_spread_bp | -0.006374 | -6.072564e-02 |
| 72 | 0,5 | cpi_x_prevol5d | 0.077828 | 2.507406e-03 |
| 73 | 0,5 | is_monday | 0.077687 | 2.365892e-03 |
| 74 | 0,5 | pre_vol_10d | 0.075510 | 1.890622e-04 |
| 75 | 0,5 | is_friday | 0.075321 | -5.551115e-16 |
| 76 | 0,5 | is_amc | 0.075321 | -5.551115e-16 |
| 77 | 0,5 | is_bmo | 0.075321 | -5.551115e-16 |
| 78 | 0,5 | macro_fedfunds | 0.072634 | -2.687242e-03 |
| 79 | 0,5 | high_vix_regime | 0.070251 | -5.070044e-03 |
| 80 | 0,5 | is_january | 0.067968 | -7.352718e-03 |
| 81 | 0,5 | pre_vol_3d | 0.067084 | -8.236664e-03 |
| 82 | 0,5 | high_rates_regime | 0.064595 | -1.072590e-02 |
| 83 | 0,5 | macro_cpi_yoy | 0.056270 | -1.905092e-02 |
| 84 | 0,5 | pre_ret_10d | 0.054410 | -2.091083e-02 |
| 85 | 0,5 | mkt_ret_1d_lag1 | 0.051872 | -2.344909e-02 |
| 86 | 0,5 | high_density_week | 0.047707 | -2.761439e-02 |
| 87 | 0,5 | weekly_density | 0.047707 | -2.761439e-02 |
| 88 | 0,5 | rates_x_surprise | 0.042411 | -3.291045e-02 |
| 89 | 0,5 | month | 0.037916 | -3.740507e-02 |
| 90 | 0,5 | cpi_x_surprise | 0.033743 | -4.157764e-02 |
| 91 | 0,5 | baa_minus_aaa_bp | 0.033658 | -4.166307e-02 |
| 92 | 0,5 | baa_minus_aaa_pct | 0.033658 | -4.166307e-02 |
| 93 | 0,5 | day_of_week | 0.028835 | -4.648578e-02 |
| 94 | 0,5 | quarter | 0.027851 | -4.746982e-02 |
| 95 | 0,5 | vix_chg_10d_lag1 | 0.027678 | -4.764274e-02 |
| 96 | 0,5 | mkt_ret_10d_lag1 | 0.024830 | -5.049113e-02 |
=== Features selected by nested forward selection (freq across outer folds) ===
| feature | selected_in_folds | window | |
|---|---|---|---|
| 0 | pre_vol_10d | 2 | 0,1 |
| 1 | is_q4 | 1 | 0,1 |
| 2 | is_january | 1 | 0,1 |
| 3 | month | 1 | 0,1 |
| 4 | baa_minus_aaa_pct | 1 | 0,1 |
| 5 | day_of_week | 1 | 0,1 |
| 6 | cpi_x_prevol5d | 1 | 0,1 |
| 7 | weekly_density | 1 | 0,1 |
| 8 | vix_x_surprise | 1 | 0,1 |
| 9 | rates_x_surprise | 1 | 0,1 |
| 10 | pre_vol_10d | 2 | 0,3 |
| 11 | is_q4 | 1 | 0,3 |
| 12 | is_january | 1 | 0,3 |
| 13 | day_of_week | 1 | 0,3 |
| 14 | vix_x_surprise | 1 | 0,3 |
| 15 | weekly_density | 1 | 0,3 |
| 16 | cpi_x_surprise | 1 | 0,3 |
| 17 | pre_vol_10d | 2 | 0,5 |
| 18 | month | 1 | 0,5 |
| 19 | is_january | 1 | 0,5 |
| 20 | baa_minus_aaa_bp | 1 | 0,5 |
| 21 | day_of_week | 1 | 0,5 |
| 22 | rates_x_surprise | 1 | 0,5 |
| 23 | cpi_x_prevol5d | 1 | 0,5 |
| 24 | high_density_week | 1 | 0,5 |
| 25 | vix_x_surprise | 1 | 0,5 |
| 26 | high_vix_regime | 1 | 0,5 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - v12_plus_EPS_growth_summary.csv - v12_plus_EPS_quick_gains.csv - v12_plus_EPS_selected_freq.csv
In [31]:
# === Compare features v1.2 vs features v1.5 (join on day0 + ticker) ===
# Metrics: coefficient of determination, adjusted coefficient of determination,
# cross validated coefficient of determination (ticker-aware)
# If needed first: pip install pandas numpy scikit-learn openpyxl
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
# ---------- CONFIG ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.2.xlsx", "features v1.5.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5
# ---------- HELPERS ----------
def find_file(name: str):
for b in BASE_DIRS:
p = b / name
if p.exists():
return p
return None
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands:
return next(iter(book))
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
m = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for name in book.keys():
if is_readme_sheet(name):
continue
for w, pat in pats.items():
if m[w] is None and pat.search(str(name)):
m[w] = name
return m
def find_day0_column(df: pd.DataFrame):
cols = [str(c) for c in df.columns]
strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
best, best_nonnull = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > best_nonnull:
best, best_nonnull = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score:
best, score = c, sc
return best
def normalize_day0(s: pd.Series) -> pd.Series:
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series) -> pd.Series:
return s.astype(str).str.strip().str.upper()
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
"""Mean test coefficient of determination using group folds when possible; row folds fallback if too few groups."""
n_groups = int(pd.Series(groups).nunique())
if n_groups >= 2:
n_splits = min(max_folds, n_groups)
gkf = GroupKFold(n_splits=n_splits)
scores = []
mdl = LinearRegression()
for tr, te in gkf.split(X, y, groups=groups):
mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
y_pred = mdl.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
# fallback to ordinary KFold on rows
n = len(X)
if n < 3:
return np.nan
n_splits = min(3, n)
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
scores = []
mdl = LinearRegression()
for tr, te in kf.split(X, y):
mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
y_pred = mdl.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
data = pd.concat([y, X], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
n, p = len(y_c), X_c.shape[1]
if p == 0 or n < max(10, p+2):
return dict(rows_used=int(n), features_used=int(p),
r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
mdl = LinearRegression().fit(X_c.values, y_c.values)
r2 = float(mdl.score(X_c.values, y_c.values))
adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
cv = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
return dict(rows_used=int(n), features_used=int(p),
r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)
# ---------- LOAD EVENT ----------
evt_path = find_file(EVENT_FILE)
if evt_path is None:
raise FileNotFoundError("event_study.xlsx not found in the configured folders.")
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
# ---------- RUN ----------
present = [f for f in FEATURE_FILES if find_file(f) is not None]
assert present, "Could not find features v1.2.xlsx or features v1.5.xlsx."
print("Testing files:", present)
merge_audit = []
results = []
for fname in present:
fpath = find_file(fname)
feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
df_feat_raw = feat_book[fsheet].copy()
dfeat = find_day0_column(df_feat_raw)
tfeat = find_ticker_column(df_feat_raw)
feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)
for w in WINDOWS:
esheet = win_map.get(w)
if esheet is None:
print(f"Missing event sheet for window {w}. Skipping.")
continue
df_evt = evt_book[esheet].copy()
devt = find_day0_column(df_evt)
tevt = find_ticker_column(df_evt)
ycol = find_target_col(df_evt)
evt = df_evt.copy()
evt["__day0__"] = normalize_day0(evt[devt])
evt["__ticker__"] = normalize_ticker(evt[tevt])
evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])
merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
groups = merged["__ticker__"]
X = build_X(merged, num_cols, ycol)
y = merged[ycol].astype(float)
merge_audit.append({
"features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
"day0_features_col": dfeat, "ticker_features_col": tfeat,
"day0_event_col": devt, "ticker_event_col": tevt,
"merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
})
m = fit_and_score(X, y, groups)
m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
results.append(m)
# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)
print("\nMerge audit:")
display(pd.DataFrame(merge_audit))
res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1.2 vs v1.5):")
display(res_df)
print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
columns="features_file",
values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
aggfunc="first")
display(wide)
# ---------- DELTAS (v1.5 minus v1.2) ----------
pairs = []
for w in WINDOWS:
a = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
b = res_df[(res_df["features_file"]=="features v1.5.xlsx") & (res_df["window"]==w)]
if not a.empty and not b.empty:
pairs.append({
"window": w,
"delta_cross_validated_r_squared": float(b["cross_validated_r_squared"].iloc[0] - a["cross_validated_r_squared"].iloc[0]),
"delta_adjusted_r_squared": float(b["adjusted_r_squared"].iloc[0] - a["adjusted_r_squared"].iloc[0]),
"delta_r_squared": float(b["r_squared"].iloc[0] - a["r_squared"].iloc[0]),
"rows_used_v1.2": int(a["rows_used"].iloc[0]),
"rows_used_v1.5": int(b["rows_used"].iloc[0]),
"features_used_v1.2": int(a["features_used"].iloc[0]),
"features_used_v1.5": int(b["features_used"].iloc[0]),
})
if pairs:
deltas = pd.DataFrame(pairs)
print("\nDeltas (v1.5 minus v1.2) — positive is good:")
display(deltas)
# Save CSVs next to your data
out_dir = find_file(EVENT_FILE).parent
res_df.to_csv(out_dir / "v1.2_vs_v1.5_results.csv", index=False)
wide.to_csv(out_dir / "v1.2_vs_v1.5_comparison_table.csv")
print(f"\nSaved to: {out_dir}")
print(" - v1.2_vs_v1.5_results.csv")
print(" - v1.2_vs_v1.5_comparison_table.csv")
Testing files: ['features v1.2.xlsx', 'features v1.5.xlsx'] Merge audit:
| features_file | features_sheet | event_sheet | window | day0_features_col | ticker_features_col | day0_event_col | ticker_event_col | merged_rows | predictor_cols | target_col | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | features v1.2.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 1 | features v1.2.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 2 | features v1.2.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 3 | features v1.5.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 4 | CAR |
| 4 | features v1.5.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 4 | CAR |
| 5 | features v1.5.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 4 | CAR |
Results (v1.2 vs v1.5):
| rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | features_file | features_sheet | window | |
|---|---|---|---|---|---|---|---|---|
| 0 | 129 | 8 | 0.245160 | 0.194838 | 0.068034 | features v1.2.xlsx | features | 0,1 |
| 1 | 129 | 4 | 0.098411 | 0.069328 | -0.093031 | features v1.5.xlsx | features | 0,1 |
| 2 | 129 | 8 | 0.201481 | 0.148246 | 0.094267 | features v1.2.xlsx | features | 0,3 |
| 3 | 129 | 4 | 0.067332 | 0.037246 | -0.095339 | features v1.5.xlsx | features | 0,3 |
| 4 | 129 | 8 | 0.214735 | 0.162384 | 0.121771 | features v1.2.xlsx | features | 0,5 |
| 5 | 129 | 4 | 0.064792 | 0.034624 | -0.095392 | features v1.5.xlsx | features | 0,5 |
Comparison table (rows = windows | columns = metrics per file):
| adjusted_r_squared | cross_validated_r_squared | r_squared | ||||
|---|---|---|---|---|---|---|
| features_file | features v1.2.xlsx | features v1.5.xlsx | features v1.2.xlsx | features v1.5.xlsx | features v1.2.xlsx | features v1.5.xlsx |
| window | ||||||
| 0,1 | 0.194838 | 0.069328 | 0.068034 | -0.093031 | 0.245160 | 0.098411 |
| 0,3 | 0.148246 | 0.037246 | 0.094267 | -0.095339 | 0.201481 | 0.067332 |
| 0,5 | 0.162384 | 0.034624 | 0.121771 | -0.095392 | 0.214735 | 0.064792 |
Deltas (v1.5 minus v1.2) — positive is good:
| window | delta_cross_validated_r_squared | delta_adjusted_r_squared | delta_r_squared | rows_used_v1.2 | rows_used_v1.5 | features_used_v1.2 | features_used_v1.5 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0,1 | -0.161065 | -0.12551 | -0.146749 | 129 | 129 | 8 | 4 |
| 1 | 0,3 | -0.189606 | -0.11100 | -0.134149 | 129 | 129 | 8 | 4 |
| 2 | 0,5 | -0.217163 | -0.12776 | -0.149943 | 129 | 129 | 8 | 4 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - v1.2_vs_v1.5_results.csv - v1.2_vs_v1.5_comparison_table.csv
In [33]:
# === Find the best add-on features from v3 for a v1.2 baseline ===
# - Join on day0 + ticker
# - Baseline = v1.2 features only
# - Rank each v3 feature by one-at-a-time cross-validated coefficient of determination gain (delta)
# - Then do greedy forward add: keep adding v3 features while cross-validated coefficient of determination improves
# - Saves: top single gains, greedy path, summary
#
# If needed first: pip install pandas numpy scikit-learn openpyxl
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
# -------- CONFIG --------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), Path("/mnt/data"),
]
EVENT_FILE = "event_study.xlsx"
BASE_FILE = "features v1.2.xlsx" # your base 8
POOL_FILE = "features v3.xlsx" # extra candidates
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5 # grouped by ticker
MAX_ADDS = 8 # try adding up to this many features (change if you want)
MIN_GAIN = 0.01 # require at least this improvement in cross-validated coefficient of determination to keep a feature
# -------- HELPERS --------
def find_file(name: str):
for b in BASE_DIRS:
p = b / name
if p.exists():
return p
raise FileNotFoundError(f"Could not find {name}")
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands:
return next(iter(book))
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
out = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for nm in book:
if is_readme_sheet(nm):
continue
for w, pat in pats.items():
if out[w] is None and pat.search(str(nm)):
out[w] = nm
return out
def find_day0_column(df: pd.DataFrame):
cols = [str(c) for c in df.columns]
strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
best, kbest = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > kbest:
best, kbest = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns:
return c
# guess
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score:
best, score = c, sc
return best
def normalize_day0(s: pd.Series):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series):
return s.astype(str).str.strip().str.upper()
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = (df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
.dropna(subset=["__day0__","__ticker__"]))
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def safe_group_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_splits=5):
"""Mean test coefficient of determination. Use GroupKFold by ticker. Fallback to KFold if needed."""
n_groups = int(pd.Series(groups).nunique())
model = LinearRegression()
scores = []
if n_groups >= 2:
splits = GroupKFold(n_splits=min(max_splits, n_groups)).split(X, y, groups=groups)
else:
# row-wise fallback
n = len(X)
if n < 3:
return np.nan
splits = KFold(n_splits=min(3, n), shuffle=True, random_state=42).split(X, y)
for tr, te in splits:
model.fit(X.iloc[tr].values, y.iloc[tr].values)
y_hat = model.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_hat)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
def in_sample_and_adjusted(X: pd.DataFrame, y: pd.Series):
if X.shape[1] == 0:
return np.nan, np.nan
mdl = LinearRegression().fit(X.values, y.values)
r2 = float(mdl.score(X.values, y.values))
n, p = len(y), X.shape[1]
adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
return r2, adj
# -------- LOAD --------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
base_book = pd.read_excel(find_file(BASE_FILE), sheet_name=None, engine="openpyxl")
base_sheet = choose_features_sheet(base_book)
base_raw = base_book[base_sheet].copy()
b_day0 = find_day0_column(base_raw); b_tic = find_ticker_column(base_raw)
base_grp, base_num_cols = aggregate_features(base_raw, b_day0, b_tic)
pool_book = pd.read_excel(find_file(POOL_FILE), sheet_name=None, engine="openpyxl")
pool_sheet = choose_features_sheet(pool_book)
pool_raw = pool_book[pool_sheet].copy()
p_day0 = find_day0_column(pool_raw); p_tic = find_ticker_column(pool_raw)
pool_grp, pool_num_cols = aggregate_features(pool_raw, p_day0, p_tic)
# -------- WORK PER WINDOW --------
all_single = []
all_paths = []
all_summary= []
for window in WINDOWS:
esheet = win_map.get(window)
if esheet is None:
print(f"Skip window {window}: event sheet not found.")
continue
df_evt = evt_book[esheet].copy()
e_day0 = find_day0_column(df_evt); e_tic = find_ticker_column(df_evt); y_col = find_target_col(df_evt)
evt = df_evt.copy()
evt["__day0__"] = normalize_day0(evt[e_day0])
evt["__ticker__"] = normalize_ticker(evt[e_tic])
evt = (evt.dropna(subset=["__day0__","__ticker__", y_col])
.drop_duplicates(subset=["__day0__","__ticker__"]))
# Merge base + event
merged_base = base_grp.merge(evt[["__day0__","__ticker__", y_col]], on=["__day0__","__ticker__"], how="inner")
X_base_all = build_X(merged_base, base_num_cols, y_col)
y = merged_base[y_col].astype(float)
groups = merged_base["__ticker__"]
# Remember which base columns actually survived cleaning
base_used = list(X_base_all.columns)
# Baseline scores
base_cv, base_adj = safe_group_cv_r2(X_base_all, y, groups, max_splits=MAX_GROUP_FOLDS), in_sample_and_adjusted(X_base_all, y)[1]
base_r2 = in_sample_and_adjusted(X_base_all, y)[0]
# Build a single merged table with pool candidates aligned
pool_only_cols = [c for c in pool_num_cols if c not in base_used]
merged_pool = pool_grp[["__day0__","__ticker__"] + pool_only_cols]
merged = merged_base.merge(merged_pool, on=["__day0__","__ticker__"], how="left")
# ---- ONE-AT-A-TIME RANKING ----
rows = []
for c in pool_only_cols:
if c not in merged.columns:
continue
Xt = pd.concat([X_base_all, merged[[c]]], axis=1)
data = pd.concat([y, Xt], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
if X_c.shape[1] == 0 or len(y_c) < 10:
continue
cv = safe_group_cv_r2(X_c, y_c, groups.loc[X_c.index], max_splits=MAX_GROUP_FOLDS)
rows.append({"window": window, "feature": c,
"cv_with_feature": cv, "delta": cv - base_cv,
"rows_used": len(y_c)})
single_df = pd.DataFrame(rows).sort_values("delta", ascending=False).reset_index(drop=True)
all_single.append(single_df)
# ---- GREEDY FORWARD ADD (start from base, add v3 features while cross-validated coefficient of determination rises) ----
added = []
current_cols = base_used.copy()
current_cv = base_cv
path_rows = []
for step in range(MAX_ADDS):
best_gain, best_feat, best_cv = 0.0, None, None
for c in pool_only_cols:
if c in added:
continue
Xt = pd.concat([merged[current_cols], merged[[c]]], axis=1)
data = pd.concat([y, Xt], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
if X_c.shape[1] == 0 or len(y_c) < 10:
continue
cv = safe_group_cv_r2(X_c, y_c, groups.loc[X_c.index], max_splits=MAX_GROUP_FOLDS)
gain = cv - current_cv
if gain > best_gain:
best_gain, best_feat, best_cv = gain, c, cv
if best_feat is None or best_gain < MIN_GAIN:
break
added.append(best_feat)
current_cols.append(best_feat)
current_cv = best_cv
r2_now, adj_now = in_sample_and_adjusted(merged[current_cols].dropna(), y.loc[merged[current_cols].dropna().index])
path_rows.append({"window": window, "step": len(added), "added_feature": best_feat,
"cv_r_squared": current_cv, "gain": best_gain,
"r_squared_in_sample": r2_now, "adjusted_r_squared_in_sample": adj_now})
path_df = pd.DataFrame(path_rows)
all_paths.append(path_df)
all_summary.append({
"window": window,
"base_used_features": ", ".join(base_used),
"baseline_cross_validated_r_squared": base_cv,
"baseline_r_squared": base_r2,
"baseline_adjusted_r_squared": base_adj,
"n_added_from_v3": len(added),
"added_features": ", ".join(added),
"final_cross_validated_r_squared": current_cv,
"improvement_vs_baseline": current_cv - base_cv
})
# -------- REPORT + SAVE --------
single_all = pd.concat(all_single, ignore_index=True) if all_single else pd.DataFrame()
path_all = pd.concat(all_paths, ignore_index=True) if all_paths else pd.DataFrame()
summary = pd.DataFrame(all_summary)
pd.set_option("display.max_rows", 200)
pd.set_option("display.max_columns", None)
print("\n=== Summary (per window) ===")
display(summary.sort_values("window"))
if not single_all.empty:
print("\n=== Top single add-on features from v3 (by delta cross-validated coefficient of determination) ===")
display(single_all.groupby("window").head(25))
if not path_all.empty:
print("\n=== Greedy forward add path (what we would add in order) ===")
display(path_all)
out_dir = find_file(EVENT_FILE).parent
summary.to_csv(out_dir / "v12_addfromv3_summary.csv", index=False)
if not single_all.empty: single_all.to_csv(out_dir / "v12_addfromv3_top_single_gains.csv", index=False)
if not path_all.empty: path_all.to_csv(out_dir / "v12_addfromv3_greedy_path.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v12_addfromv3_summary.csv")
print(" - v12_addfromv3_top_single_gains.csv")
print(" - v12_addfromv3_greedy_path.csv")
=== Summary (per window) ===
| window | base_used_features | baseline_cross_validated_r_squared | baseline_r_squared | baseline_adjusted_r_squared | n_added_from_v3 | added_features | final_cross_validated_r_squared | improvement_vs_baseline | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0,1 | eps_surprise_pct, pre_ret_3d, pre_ret_5d, pre_... | 0.068034 | 0.245160 | 0.194838 | 1 | macro_cpi_yoy | 0.081215 | 0.013181 |
| 1 | 0,3 | eps_surprise_pct, pre_ret_3d, pre_ret_5d, pre_... | 0.094267 | 0.201481 | 0.148246 | 0 | 0.094267 | 0.000000 | |
| 2 | 0,5 | eps_surprise_pct, pre_ret_3d, pre_ret_5d, pre_... | 0.121771 | 0.214735 | 0.162384 | 0 | 0.121771 | 0.000000 |
=== Top single add-on features from v3 (by delta cross-validated coefficient of determination) ===
| window | feature | cv_with_feature | delta | rows_used | |
|---|---|---|---|---|---|
| 0 | 0,1 | macro_cpi_yoy | 0.081215 | 1.318111e-02 | 129 |
| 1 | 0,1 | pre_vol_10d | 0.072635 | 4.601765e-03 | 129 |
| 2 | 0,1 | rates_x_surprise | 0.069896 | 1.862384e-03 | 129 |
| 3 | 0,1 | is_amc | 0.068034 | -8.881784e-16 | 129 |
| 4 | 0,1 | is_bmo | 0.068034 | -8.881784e-16 | 129 |
| 5 | 0,1 | is_friday | 0.068034 | -8.881784e-16 | 129 |
| 6 | 0,1 | pre_ret_10d | 0.065582 | -2.451491e-03 | 129 |
| 7 | 0,1 | cpi_x_prevol5d | 0.063162 | -4.871555e-03 | 129 |
| 8 | 0,1 | is_monday | 0.060641 | -7.392425e-03 | 129 |
| 9 | 0,1 | vix_chg_10d_lag1 | 0.060394 | -7.639289e-03 | 129 |
| 10 | 0,1 | high_vix_regime | 0.057423 | -1.061060e-02 | 129 |
| 11 | 0,1 | cpi_x_surprise | 0.054169 | -1.386500e-02 | 129 |
| 12 | 0,1 | is_january | 0.051872 | -1.616123e-02 | 129 |
| 13 | 0,1 | investment_grade_option_adjusted_spread_bp | 0.050660 | -1.737398e-02 | 129 |
| 14 | 0,1 | investment_grade_option_adjusted_spread_pct | 0.050660 | -1.737398e-02 | 129 |
| 15 | 0,1 | macro_fedfunds | 0.050613 | -1.742084e-02 | 129 |
| 16 | 0,1 | high_rates_regime | 0.049974 | -1.805981e-02 | 129 |
| 17 | 0,1 | pre_vol_3d | 0.048157 | -1.987686e-02 | 129 |
| 18 | 0,1 | weekly_density | 0.048107 | -1.992688e-02 | 129 |
| 19 | 0,1 | high_density_week | 0.048107 | -1.992688e-02 | 129 |
| 20 | 0,1 | month | 0.046575 | -2.145911e-02 | 129 |
| 21 | 0,1 | baa_minus_aaa_bp | 0.044385 | -2.364825e-02 | 129 |
| 22 | 0,1 | baa_minus_aaa_pct | 0.044385 | -2.364825e-02 | 129 |
| 23 | 0,1 | mkt_ret_1d_lag1 | 0.041337 | -2.669715e-02 | 129 |
| 24 | 0,1 | quarter | 0.040292 | -2.774160e-02 | 129 |
| 36 | 0,3 | macro_cpi_yoy | 0.096439 | 2.172138e-03 | 129 |
| 37 | 0,3 | is_amc | 0.094267 | 6.383782e-16 | 129 |
| 38 | 0,3 | is_friday | 0.094267 | 6.383782e-16 | 129 |
| 39 | 0,3 | is_bmo | 0.094267 | 6.383782e-16 | 129 |
| 40 | 0,3 | pre_vol_10d | 0.093876 | -3.911947e-04 | 129 |
| 41 | 0,3 | is_monday | 0.093087 | -1.179961e-03 | 129 |
| 42 | 0,3 | macro_fedfunds | 0.090032 | -4.235049e-03 | 129 |
| 43 | 0,3 | rates_x_surprise | 0.089324 | -4.943540e-03 | 129 |
| 44 | 0,3 | cpi_x_prevol5d | 0.088516 | -5.751032e-03 | 129 |
| 45 | 0,3 | investment_grade_option_adjusted_spread_bp | 0.088365 | -5.902608e-03 | 129 |
| 46 | 0,3 | investment_grade_option_adjusted_spread_pct | 0.088365 | -5.902608e-03 | 129 |
| 47 | 0,3 | cpi_x_surprise | 0.080796 | -1.347118e-02 | 129 |
| 48 | 0,3 | pre_vol_3d | 0.080411 | -1.385620e-02 | 129 |
| 49 | 0,3 | high_rates_regime | 0.079143 | -1.512391e-02 | 129 |
| 50 | 0,3 | is_january | 0.078816 | -1.545143e-02 | 129 |
| 51 | 0,3 | high_vix_regime | 0.077928 | -1.633957e-02 | 129 |
| 52 | 0,3 | pre_ret_10d | 0.077158 | -1.710937e-02 | 129 |
| 53 | 0,3 | baa_minus_aaa_pct | 0.076349 | -1.791780e-02 | 129 |
| 54 | 0,3 | baa_minus_aaa_bp | 0.076349 | -1.791780e-02 | 129 |
| 55 | 0,3 | weekly_density | 0.074380 | -1.988693e-02 | 129 |
| 56 | 0,3 | high_density_week | 0.074380 | -1.988693e-02 | 129 |
| 57 | 0,3 | month | 0.068821 | -2.544574e-02 | 129 |
| 58 | 0,3 | mkt_ret_1d_lag1 | 0.067033 | -2.723425e-02 | 129 |
| 59 | 0,3 | quarter | 0.059743 | -3.452387e-02 | 129 |
| 60 | 0,3 | day_of_week | 0.054746 | -3.952102e-02 | 129 |
| 72 | 0,5 | is_amc | 0.121771 | 1.221245e-15 | 129 |
| 73 | 0,5 | is_bmo | 0.121771 | 1.221245e-15 | 129 |
| 74 | 0,5 | is_friday | 0.121771 | 1.221245e-15 | 129 |
| 75 | 0,5 | pre_vol_10d | 0.120651 | -1.120119e-03 | 129 |
| 76 | 0,5 | macro_fedfunds | 0.119332 | -2.438700e-03 | 129 |
| 77 | 0,5 | macro_cpi_yoy | 0.117592 | -4.178903e-03 | 129 |
| 78 | 0,5 | high_density_week | 0.116452 | -5.318717e-03 | 129 |
| 79 | 0,5 | weekly_density | 0.116452 | -5.318717e-03 | 129 |
| 80 | 0,5 | cpi_x_prevol5d | 0.115113 | -6.657469e-03 | 129 |
| 81 | 0,5 | is_monday | 0.114571 | -7.199965e-03 | 129 |
| 82 | 0,5 | investment_grade_option_adjusted_spread_bp | 0.113210 | -8.560419e-03 | 129 |
| 83 | 0,5 | investment_grade_option_adjusted_spread_pct | 0.113210 | -8.560419e-03 | 129 |
| 84 | 0,5 | high_vix_regime | 0.111899 | -9.871706e-03 | 129 |
| 85 | 0,5 | high_rates_regime | 0.111492 | -1.027892e-02 | 129 |
| 86 | 0,5 | baa_minus_aaa_bp | 0.110206 | -1.156436e-02 | 129 |
| 87 | 0,5 | baa_minus_aaa_pct | 0.110206 | -1.156436e-02 | 129 |
| 88 | 0,5 | is_january | 0.108703 | -1.306799e-02 | 129 |
| 89 | 0,5 | pre_ret_10d | 0.106843 | -1.492733e-02 | 129 |
| 90 | 0,5 | pre_vol_3d | 0.105928 | -1.584277e-02 | 129 |
| 91 | 0,5 | month | 0.101969 | -1.980198e-02 | 129 |
| 92 | 0,5 | rates_x_surprise | 0.097229 | -2.454192e-02 | 129 |
| 93 | 0,5 | mkt_ret_1d_lag1 | 0.097109 | -2.466205e-02 | 129 |
| 94 | 0,5 | cpi_x_surprise | 0.095834 | -2.593715e-02 | 129 |
| 95 | 0,5 | quarter | 0.092512 | -2.925895e-02 | 129 |
| 96 | 0,5 | vix_chg_10d_lag1 | 0.091427 | -3.034321e-02 | 129 |
=== Greedy forward add path (what we would add in order) ===
| window | step | added_feature | cv_r_squared | gain | r_squared_in_sample | adjusted_r_squared_in_sample | |
|---|---|---|---|---|---|---|---|
| 0 | 0,1 | 1 | macro_cpi_yoy | 0.081215 | 0.013181 | 0.262488 | 0.20671 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - v12_addfromv3_summary.csv - v12_addfromv3_top_single_gains.csv - v12_addfromv3_greedy_path.csv
In [35]:
# === Compare features v1.2 vs v1.3 vs v1.4 (join on day0 + ticker) ===
# Metrics shown per window: R^2, Adjusted R^2, Cross-validated R^2 (grouped by ticker)
# If needed first: pip install pandas numpy scikit-learn openpyxl
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
# ---------- CONFIG ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.2.xlsx", "features v1.3.xlsx", "features v1.4.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5
# ---------- HELPERS ----------
def find_file(name: str):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(f"Could not find {name}")
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands: return next(iter(book))
# pick sheet with the most numeric columns (then most rows)
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
out = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for nm in book:
if is_readme_sheet(nm): continue
for w, pat in pats.items():
if out[w] is None and pat.search(str(nm)): out[w] = nm
return out
def find_day0_column(df: pd.DataFrame):
strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
# fallback: most date-like
best, kbest = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > kbest: best, kbest = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
# fallback: best-looking object column
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score: best, score = c, sc
return best
def normalize_day0(s: pd.Series):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series):
return s.astype(str).str.strip().str.upper()
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
"""Mean test coefficient of determination with GroupKFold by ticker; KFold fallback if too few tickers."""
n_groups = int(pd.Series(groups).nunique())
mdl = LinearRegression()
scores = []
if n_groups >= 2:
gkf = GroupKFold(n_splits=min(max_folds, n_groups))
splits = gkf.split(X, y, groups=groups)
else:
n = len(X)
if n < 3: return np.nan
splits = KFold(n_splits=min(3, n), shuffle=True, random_state=42).split(X, y)
for tr, te in splits:
mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
y_hat = mdl.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_hat)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
data = pd.concat([y, X], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
n, p = len(y_c), X_c.shape[1]
if p == 0 or n < max(10, p+2):
return dict(rows_used=int(n), features_used=int(p),
r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
mdl = LinearRegression().fit(X_c.values, y_c.values)
r2 = float(mdl.score(X_c.values, y_c.values))
adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
cv = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
return dict(rows_used=int(n), features_used=int(p),
r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)
# ---------- LOAD ----------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
present = [f for f in FEATURE_FILES if (BASE_DIRS[0]/f).exists() or (Path(".")/f).exists() or (Path("/mnt/data")/f).exists()]
assert present, "None of the features files were found."
merge_audit = []
results = []
for fname in present:
fpath = find_file(fname)
feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
df_feat_raw = feat_book[fsheet].copy()
dfeat = find_day0_column(df_feat_raw)
tfeat = find_ticker_column(df_feat_raw)
feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)
for w in WINDOWS:
esheet = win_map.get(w)
if esheet is None:
print(f"Missing event sheet for window {w}. Skipping.")
continue
df_evt = evt_book[esheet].copy()
devt = find_day0_column(df_evt)
tevt = find_ticker_column(df_evt)
ycol = find_target_col(df_evt)
evt = df_evt.copy()
evt["__day0__"] = normalize_day0(evt[devt])
evt["__ticker__"] = normalize_ticker(evt[tevt])
evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])
merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
groups = merged["__ticker__"]
X = build_X(merged, num_cols, ycol)
y = merged[ycol].astype(float)
merge_audit.append({
"features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
"day0_features_col": dfeat, "ticker_features_col": tfeat,
"day0_event_col": devt, "ticker_event_col": tevt,
"merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
})
m = fit_and_score(X, y, groups)
m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
results.append(m)
# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)
print("\nMerge audit:")
display(pd.DataFrame(merge_audit))
res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1.2 vs v1.3 vs v1.4):")
display(res_df)
print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
columns="features_file",
values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
aggfunc="first")
display(wide)
# ---------- DELTAS vs baseline v1.2 ----------
pairs = []
for w in WINDOWS:
base = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
for alt in ["features v1.3.xlsx", "features v1.4.xlsx"]:
comp = res_df[(res_df["features_file"]==alt) & (res_df["window"]==w)]
if not base.empty and not comp.empty:
pairs.append({
"window": w,
"model_vs_v1.2": alt,
"delta_cross_validated_r_squared": float(comp["cross_validated_r_squared"].iloc[0] - base["cross_validated_r_squared"].iloc[0]),
"delta_adjusted_r_squared": float(comp["adjusted_r_squared"].iloc[0] - base["adjusted_r_squared"].iloc[0]),
"delta_r_squared": float(comp["r_squared"].iloc[0] - base["r_squared"].iloc[0]),
"rows_used_base": int(base["rows_used"].iloc[0]),
"rows_used_alt": int(comp["rows_used"].iloc[0]),
"features_used_base": int(base["features_used"].iloc[0]),
"features_used_alt": int(comp["features_used"].iloc[0]),
})
if pairs:
deltas = pd.DataFrame(pairs).sort_values(["window","model_vs_v1.2"]).reset_index(drop=True)
print("\nDeltas vs v1.2 — positive is good:")
display(deltas)
# ---------- SAVE ----------
out_dir = find_file(EVENT_FILE).parent
res_df.to_csv(out_dir / "v1.2_v1.3_v1.4_results.csv", index=False)
wide.to_csv(out_dir / "v1.2_v1.3_v1.4_comparison_table.csv")
if pairs:
deltas.to_csv(out_dir / "v1.2_v1.3_v1.4_deltas_vs_v12.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v1.2_v1.3_v1.4_results.csv")
print(" - v1.2_v1.3_v1.4_comparison_table.csv")
print(" - v1.2_v1.3_v1.4_deltas_vs_v12.csv")
Merge audit:
| features_file | features_sheet | event_sheet | window | day0_features_col | ticker_features_col | day0_event_col | ticker_event_col | merged_rows | predictor_cols | target_col | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | features v1.2.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 1 | features v1.2.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 2 | features v1.2.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 3 | features v1.3.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 9 | CAR |
| 4 | features v1.3.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 9 | CAR |
| 5 | features v1.3.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 9 | CAR |
| 6 | features v1.4.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 9 | CAR |
| 7 | features v1.4.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 9 | CAR |
| 8 | features v1.4.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 9 | CAR |
Results (v1.2 vs v1.3 vs v1.4):
| rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | features_file | features_sheet | window | |
|---|---|---|---|---|---|---|---|---|
| 0 | 129 | 8 | 0.245160 | 0.194838 | 0.068034 | features v1.2.xlsx | features | 0,1 |
| 1 | 129 | 9 | 0.251629 | 0.195030 | 0.037240 | features v1.3.xlsx | features | 0,1 |
| 2 | 129 | 9 | 0.257667 | 0.201524 | 0.045910 | features v1.4.xlsx | features | 0,1 |
| 3 | 129 | 8 | 0.201481 | 0.148246 | 0.094267 | features v1.2.xlsx | features | 0,3 |
| 4 | 129 | 9 | 0.212109 | 0.152520 | 0.018953 | features v1.3.xlsx | features | 0,3 |
| 5 | 129 | 9 | 0.208601 | 0.148747 | 0.054705 | features v1.4.xlsx | features | 0,3 |
| 6 | 129 | 8 | 0.214735 | 0.162384 | 0.121771 | features v1.2.xlsx | features | 0,5 |
| 7 | 129 | 9 | 0.221031 | 0.162118 | 0.031952 | features v1.3.xlsx | features | 0,5 |
| 8 | 129 | 9 | 0.215733 | 0.156419 | 0.082984 | features v1.4.xlsx | features | 0,5 |
Comparison table (rows = windows | columns = metrics per file):
| adjusted_r_squared | cross_validated_r_squared | r_squared | |||||||
|---|---|---|---|---|---|---|---|---|---|
| features_file | features v1.2.xlsx | features v1.3.xlsx | features v1.4.xlsx | features v1.2.xlsx | features v1.3.xlsx | features v1.4.xlsx | features v1.2.xlsx | features v1.3.xlsx | features v1.4.xlsx |
| window | |||||||||
| 0,1 | 0.194838 | 0.195030 | 0.201524 | 0.068034 | 0.037240 | 0.045910 | 0.245160 | 0.251629 | 0.257667 |
| 0,3 | 0.148246 | 0.152520 | 0.148747 | 0.094267 | 0.018953 | 0.054705 | 0.201481 | 0.212109 | 0.208601 |
| 0,5 | 0.162384 | 0.162118 | 0.156419 | 0.121771 | 0.031952 | 0.082984 | 0.214735 | 0.221031 | 0.215733 |
Deltas vs v1.2 — positive is good:
| window | model_vs_v1.2 | delta_cross_validated_r_squared | delta_adjusted_r_squared | delta_r_squared | rows_used_base | rows_used_alt | features_used_base | features_used_alt | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0,1 | features v1.3.xlsx | -0.030793 | 0.000192 | 0.006469 | 129 | 129 | 8 | 9 |
| 1 | 0,1 | features v1.4.xlsx | -0.022123 | 0.006686 | 0.012506 | 129 | 129 | 8 | 9 |
| 2 | 0,3 | features v1.3.xlsx | -0.075314 | 0.004274 | 0.010628 | 129 | 129 | 8 | 9 |
| 3 | 0,3 | features v1.4.xlsx | -0.039562 | 0.000500 | 0.007120 | 129 | 129 | 8 | 9 |
| 4 | 0,5 | features v1.3.xlsx | -0.089819 | -0.000266 | 0.006297 | 129 | 129 | 8 | 9 |
| 5 | 0,5 | features v1.4.xlsx | -0.038787 | -0.005965 | 0.000998 | 129 | 129 | 8 | 9 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - v1.2_v1.3_v1.4_results.csv - v1.2_v1.3_v1.4_comparison_table.csv - v1.2_v1.3_v1.4_deltas_vs_v12.csv
In [41]:
# === Audit: does "macro CPI YoY" from v3 lift cross validated coefficient of determination? ===
# Compares per window: v1.2 baseline, v1.2 + macro from v3, v1.3 file.
# Also checks whether v1.3's macro column equals the v3 macro values after join on day0 + ticker.
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
# ----------------- CONFIG -----------------
SEARCH_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
BASE_FILE = "features v1.2.xlsx"
V13_FILE = "features v1.3.xlsx"
V3_FILE = "features v3.xlsx"
WINDOWS = ["0,1", "0,3", "0,5"]
MAX_FOLDS = 5
# ----------------- HELPERS -----------------
def find_file(name: str) -> Path:
for d in SEARCH_DIRS:
p = d / name
if p.exists():
return p
raise FileNotFoundError(f"Could not find: {name}")
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
# pick non-readme sheet with most numeric columns, then most rows
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands:
return next(iter(book))
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def window_sheets(book: dict):
out = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for nm in book:
if is_readme_sheet(nm):
continue
for w, pat in pats.items():
if out[w] is None and pat.search(str(nm)):
out[w] = nm
return out
def find_day0(df: pd.DataFrame):
strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
# fallback: most date-like column
best, kbest = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > kbest:
best, kbest = c, k
return best
def find_ticker(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
# fallback: best object column by uniqueness
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score:
best, score = c, sc
return best
def find_target(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def norm_day0(s: pd.Series):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def norm_ticker(s: pd.Series):
return s.astype(str).str.strip().str.upper()
def group_numeric(df: pd.DataFrame, day0_col: str, tic_col: str):
g = df.copy()
g["__day0__"] = norm_day0(g[day0_col])
g["__tic__"] = norm_ticker(g[tic_col])
num = g.select_dtypes(include=[np.number]).columns.tolist()
g = (g.groupby(["__day0__","__tic__"], as_index=False)[num].mean()
.dropna(subset=["__day0__","__tic__"]))
return g, num
def build_X(merged: pd.DataFrame, cols: list, ycol: str):
keep = [c for c in cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[ycol], errors="ignore")
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
n_groups = int(pd.Series(groups).nunique())
mdl = LinearRegression()
scores = []
if n_groups >= 2:
splitter = GroupKFold(n_splits=min(max_folds, n_groups))
splits = splitter.split(X, y, groups=groups)
else:
n = len(X)
if n < 3: return np.nan
splitter = KFold(n_splits=min(3, n), shuffle=True, random_state=42)
splits = splitter.split(X, y)
for tr, te in splits:
mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
pred = mdl.predict(X.iloc[te].values)
true = y.iloc[te].values
ss_res = np.sum((true - pred)**2)
ss_tot = np.sum((true - np.mean(true))**2)
scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
def r2_and_adjusted(X: pd.DataFrame, y: pd.Series):
mdl = LinearRegression().fit(X.values, y.values)
r2 = float(mdl.score(X.values, y.values))
n, p = len(y), X.shape[1]
adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
return r2, adj
def find_macro_cpi_yoy(cols: list) -> str | None:
# Flexible match for names like "macro_cpi_yoy", "Macro CPI YoY", etc.
for c in cols:
s = re.sub(r"[^a-z0-9]+", "", str(c).lower())
if "macro" in s and "cpi" in s and ("yoy" in s or "yearover" in s or "yoy" in s):
return c
# second pass: contains "cpi" and "yoy"
for c in cols:
s = str(c).lower()
if "cpi" in s and "yoy" in s:
return c
return None
# ----------------- LOAD FILES -----------------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)
b_book = pd.read_excel(find_file(BASE_FILE), sheet_name=None, engine="openpyxl")
b_sheet = choose_features_sheet(b_book); b_raw = b_book[b_sheet].copy()
b_day0 = find_day0(b_raw); b_tic = find_ticker(b_raw)
b_grp, b_cols = group_numeric(b_raw, b_day0, b_tic)
v3_book = pd.read_excel(find_file(V3_FILE), sheet_name=None, engine="openpyxl")
v3_sheet = choose_features_sheet(v3_book); v3_raw = v3_book[v3_sheet].copy()
v3_day0 = find_day0(v3_raw); v3_tic = find_ticker(v3_raw)
v3_grp, v3_cols = group_numeric(v3_raw, v3_day0, v3_tic)
v13_book = pd.read_excel(find_file(V13_FILE), sheet_name=None, engine="openpyxl")
v13_sheet = choose_features_sheet(v13_book); v13_raw = v13_book[v13_sheet].copy()
v13_day0 = find_day0(v13_raw); v13_tic = find_ticker(v13_raw)
v13_grp, v13_cols = group_numeric(v13_raw, v13_day0, v13_tic)
macro_v3_col = find_macro_cpi_yoy(v3_cols)
macro_v13_col = find_macro_cpi_yoy(v13_cols)
if macro_v3_col is None:
raise ValueError("Could not find a 'macro CPI YoY' column in v3.")
# ----------------- RUN PER WINDOW -----------------
rows = []
checks = []
for w in WINDOWS:
es = win_map.get(w)
if es is None:
print(f"Skip {w}: no event sheet.")
continue
ev = evt_book[es].copy()
e_day0 = find_day0(ev); e_tic = find_ticker(ev); ycol = find_target(ev)
ev["__day0__"] = norm_day0(ev[e_day0])
ev["__tic__"] = norm_ticker(ev[e_tic])
ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])
# --- v1.2 baseline ---
mb = b_grp.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")
X_base = build_X(mb, b_cols, ycol)
y = mb[ycol].astype(float)
groups = mb["__tic__"]
base_cv = cv_r2(X_base, y, groups, MAX_FOLDS)
base_r2, base_adj = r2_and_adjusted(X_base, y)
# --- v1.2 + macro column from v3 ---
macro_from_v3 = v3_grp[["__day0__","__tic__", macro_v3_col]].rename(columns={macro_v3_col:"macro_v3"})
mbm = mb.merge(macro_from_v3, on=["__day0__","__tic__"], how="left")
X_plus = pd.concat([X_base, mbm[["macro_v3"]]], axis=1)
data_plus = pd.concat([y, X_plus], axis=1).dropna()
y_plus, X_plus_c = data_plus.iloc[:,0], data_plus.iloc[:,1:]
plus_cv = cv_r2(X_plus_c, y_plus, groups.loc[X_plus_c.index], MAX_FOLDS)
# --- v1.3 file as-is ---
mv13 = v13_grp.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")
X_13 = build_X(mv13, v13_cols, ycol)
y_13 = mv13[ycol].astype(float)
g_13 = mv13["__tic__"]
v13_cv = cv_r2(X_13, y_13, g_13, MAX_FOLDS)
rows.append({
"window": w,
"rows_used": int(len(X_base)),
"features_used_v12": int(X_base.shape[1]),
"base_cross_validated_r_squared": float(base_cv),
"base_r_squared": float(base_r2),
"base_adjusted_r_squared": float(base_adj),
"v12_plus_macro_cross_validated_r_squared": float(plus_cv),
"delta_plus_vs_base": float(plus_cv - base_cv),
"v13_cross_validated_r_squared": float(v13_cv),
"features_used_v13": int(X_13.shape[1]),
})
# --- macro value equality check (v3 vs v1.3 on the same rows) ---
if macro_v13_col and macro_v13_col in mv13.columns:
macro_from_v13 = mv13[["__day0__","__tic__", macro_v13_col]].rename(columns={macro_v13_col:"macro_v13"})
join_macro = (
macro_from_v3
.merge(macro_from_v13, on=["__day0__","__tic__"], how="inner")
.dropna(subset=["macro_v3","macro_v13"])
)
if len(join_macro) > 0:
share_equal = (join_macro["macro_v3"].round(10) == join_macro["macro_v13"].round(10)).mean()
mean_abs_diff = (join_macro["macro_v3"] - join_macro["macro_v13"]).abs().mean()
corr = np.corrcoef(join_macro["macro_v3"], join_macro["macro_v13"])[0,1] if len(join_macro) > 2 else np.nan
else:
share_equal, mean_abs_diff, corr = np.nan, np.nan, np.nan
checks.append({
"window": w,
"macro_rows_overlap": int(len(join_macro)),
"share_exact_equal": float(share_equal) if pd.notna(share_equal) else np.nan,
"mean_abs_diff": float(mean_abs_diff) if pd.notna(mean_abs_diff) else np.nan,
"corr_v3_vs_v13": float(corr) if pd.notna(corr) else np.nan
})
else:
checks.append({
"window": w,
"macro_rows_overlap": 0,
"share_exact_equal": np.nan,
"mean_abs_diff": np.nan,
"corr_v3_vs_v13": np.nan
})
# ----------------- SHOW + SAVE -----------------
scores = pd.DataFrame(rows).sort_values("window").reset_index(drop=True)
macro_check = pd.DataFrame(checks).sort_values("window").reset_index(drop=True)
pd.set_option("display.max_columns", None)
print("\n=== Scores per window ===")
display(scores)
print("\n=== Macro CPI YoY equality check (v3 vs v1.3) ===")
display(macro_check)
# Save next to the event file
out_dir = find_file(EVENT_FILE).parent
scores.to_csv(out_dir / "audit_v12_vs_v12plusmacro_vs_v13_scores.csv", index=False)
macro_check.to_csv(out_dir / "audit_macro_value_check.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - audit_v12_vs_v12plusmacro_vs_v13_scores.csv")
print(" - audit_macro_value_check.csv")
=== Scores per window ===
| window | rows_used | features_used_v12 | base_cross_validated_r_squared | base_r_squared | base_adjusted_r_squared | v12_plus_macro_cross_validated_r_squared | delta_plus_vs_base | v13_cross_validated_r_squared | features_used_v13 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0,1 | 129 | 8 | 0.068034 | 0.245160 | 0.194838 | 0.081215 | 0.013181 | 0.037240 | 9 |
| 1 | 0,3 | 129 | 8 | 0.094267 | 0.201481 | 0.148246 | 0.096439 | 0.002172 | 0.018953 | 9 |
| 2 | 0,5 | 129 | 8 | 0.121771 | 0.214735 | 0.162384 | 0.117592 | -0.004179 | 0.031952 | 9 |
=== Macro CPI YoY equality check (v3 vs v1.3) ===
| window | macro_rows_overlap | share_exact_equal | mean_abs_diff | corr_v3_vs_v13 | |
|---|---|---|---|---|---|
| 0 | 0,1 | 129 | 0.023256 | 2.485246 | -0.217281 |
| 1 | 0,3 | 129 | 0.023256 | 2.485246 | -0.217281 |
| 2 | 0,5 | 129 | 0.023256 | 2.485246 | -0.217281 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - audit_v12_vs_v12plusmacro_vs_v13_scores.csv - audit_macro_value_check.csv
In [45]:
# === Evaluate features v1.2 only (windows 0,1 / 0,3 / 0,5) ===
# Metrics: R^2, Adjusted R^2, Cross-validated R^2 (grouped by ticker; safe fallback)
# If needed first: pip install pandas numpy scikit-learn openpyxl
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
# ---------- CONFIG ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURES_12 = "features v1.2.xlsx"
WINDOWS = ["0,1", "0,3", "0,5"]
MAX_GROUP_FOLDS = 5
# ---------- HELPERS ----------
def find_file(name: str):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(f"Could not find: {name}")
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands: return next(iter(book))
def score(item):
n, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
out = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for nm in book:
if is_readme_sheet(nm): continue
for w, pat in pats.items():
if out[w] is None and pat.search(str(nm)): out[w] = nm
return out
def find_day0_column(df: pd.DataFrame):
strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
# most date-like
best, kbest = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > kbest: best, kbest = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score: best, score = c, sc
return best
def normalize_day0(s: pd.Series):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series):
return s.astype(str).str.strip().str.upper()
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
n_groups = int(pd.Series(groups).nunique())
mdl = LinearRegression()
scores = []
if n_groups >= 2:
gkf = GroupKFold(n_splits=min(max_folds, n_groups))
splits = gkf.split(X, y, groups=groups)
else:
n = len(X)
if n < 3: return np.nan
splits = KFold(n_splits=min(3, n), shuffle=True, random_state=42).split(X, y)
for tr, te in splits:
mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
y_hat = mdl.predict(X.iloc[te].values)
y_true = y.iloc[te].values
ss_res = np.sum((y_true - y_hat)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
data = pd.concat([y, X], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
n, p = len(y_c), X_c.shape[1]
if p == 0 or n < max(10, p+2):
return dict(rows_used=int(n), features_used=int(p),
r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
mdl = LinearRegression().fit(X_c.values, y_c.values)
r2 = float(mdl.score(X_c.values, y_c.values))
adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
cv = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
return dict(rows_used=int(n), features_used=int(p),
r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)
# ---------- LOAD ----------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
f12_path = find_file(FEATURES_12)
f12_book = pd.read_excel(f12_path, sheet_name=None, engine="openpyxl")
f12_sheet = choose_features_sheet(f12_book)
f12_raw = f12_book[f12_sheet].copy()
f12_day0 = find_day0_column(f12_raw)
f12_tic = find_ticker_column(f12_raw)
f12_grp, f12_num_cols = aggregate_features(f12_raw, f12_day0, f12_tic)
# ---------- RUN ----------
rows = []
merge_audit = []
for w in WINDOWS:
esheet = win_map.get(w)
if esheet is None:
print(f"Skip window {w}: event sheet not found.")
continue
ev = evt_book[esheet].copy()
e_day0 = find_day0_column(ev); e_tic = find_ticker_column(ev); ycol = find_target_col(ev)
ev["__day0__"] = normalize_day0(ev[e_day0])
ev["__ticker__"] = normalize_ticker(ev[e_tic])
ev = ev.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])
merged = f12_grp.merge(ev[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
groups = merged["__ticker__"]
X = build_X(merged, f12_num_cols, ycol)
y = merged[ycol].astype(float)
merge_audit.append({
"window": w,
"features_sheet": f12_sheet,
"event_sheet": esheet,
"day0_features_col": f12_day0,
"ticker_features_col": f12_tic,
"day0_event_col": e_day0,
"ticker_event_col": e_tic,
"merged_rows": len(merged),
"predictor_cols_after_clean": X.shape[1],
"target_col": ycol,
"features_used": ", ".join(list(X.columns))
})
m = fit_and_score(X, y, groups)
m.update(dict(window=w))
rows.append(m)
# ---------- SHOW ----------
audit_df = pd.DataFrame(merge_audit)
res_df = pd.DataFrame(rows).sort_values("window").reset_index(drop=True)
pd.set_option("display.max_columns", None)
print("\nMerge audit for v1.2:")
display(audit_df)
print("\nResults for v1.2 only:")
display(res_df)
# ---------- SAVE ----------
out_dir = evt_path.parent
res_df.to_csv(out_dir / "v12_results_only.csv", index=False)
audit_df.to_csv(out_dir / "v12_merge_audit.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v12_results_only.csv")
print(" - v12_merge_audit.csv")
Merge audit for v1.2:
| window | features_sheet | event_sheet | day0_features_col | ticker_features_col | day0_event_col | ticker_event_col | merged_rows | predictor_cols_after_clean | target_col | features_used | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0,1 | features | CAR_(0,1) | day0 | ticker | day0 | ticker | 129 | 8 | CAR | eps_surprise_pct, pre_ret_3d, pre_ret_5d, pre_... |
| 1 | 0,3 | features | CAR_(0,3) | day0 | ticker | day0 | ticker | 129 | 8 | CAR | eps_surprise_pct, pre_ret_3d, pre_ret_5d, pre_... |
| 2 | 0,5 | features | CAR_(0,5) | day0 | ticker | day0 | ticker | 129 | 8 | CAR | eps_surprise_pct, pre_ret_3d, pre_ret_5d, pre_... |
Results for v1.2 only:
| rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | window | |
|---|---|---|---|---|---|---|
| 0 | 129 | 8 | 0.245160 | 0.194838 | 0.068034 | 0,1 |
| 1 | 129 | 8 | 0.201481 | 0.148246 | 0.094267 | 0,3 |
| 2 | 129 | 8 | 0.214735 | 0.162384 | 0.121771 | 0,5 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - v12_results_only.csv - v12_merge_audit.csv
In [48]:
# === Final check: features v1.xlsx vs features v1.2.xlsx (join on day0 + ticker) ===
# Metrics per window: R^2, Adjusted R^2, Cross-validated R^2 (grouped by ticker; safe fallback)
# Saves: results, wide comparison, and deltas vs v1.2
#
# If needed first: pip install pandas numpy scikit-learn openpyxl
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
# ---------- CONFIG ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.xlsx", "features v1.2.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5
# ---------- HELPERS ----------
def find_file(name: str):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(f"Could not find {name}")
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
# choose non-readme sheet with most numeric columns (then most rows)
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands: return next(iter(book))
def score(item):
_, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
out = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for nm in book:
if is_readme_sheet(nm): continue
for w, pat in pats.items():
if out[w] is None and pat.search(str(nm)): out[w] = nm
return out
def find_day0_column(df: pd.DataFrame):
strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
# fallback: most date-like
best, kbest = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > kbest: best, kbest = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
# fallback: best object col by uniqueness
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score: best, score = c, sc
return best
def normalize_day0(s: pd.Series):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series):
return s.astype(str).str.strip().str.upper()
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.IGNORECASE)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
n_groups = int(pd.Series(groups).nunique())
mdl = LinearRegression()
scores = []
if n_groups >= 2:
gkf = GroupKFold(n_splits=min(max_folds, n_groups))
splits = gkf.split(X, y, groups=groups)
else:
n = len(X)
if n < 3: return np.nan
splits = KFold(n_splits=min(3, n), shuffle=True, random_state=42).split(X, y)
for tr, te in splits:
mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
y_hat = mdl.predict(X.iloc[te].values)
yt = y.iloc[te].values
ss_res = np.sum((yt - y_hat)**2); ss_tot = np.sum((yt - np.mean(yt))**2)
scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
data = pd.concat([y, X], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
n, p = len(y_c), X_c.shape[1]
if p == 0 or n < max(10, p+2):
return dict(rows_used=int(n), features_used=int(p),
r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
mdl = LinearRegression().fit(X_c.values, y_c.values)
r2 = float(mdl.score(X_c.values, y_c.values))
adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
cv = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
return dict(rows_used=int(n), features_used=int(p),
r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)
# ---------- LOAD ----------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
present = [f for f in FEATURE_FILES if any((b/f).exists() for b in BASE_DIRS)]
assert present, "Could not find features v1.xlsx or features v1.2.xlsx"
merge_audit = []
results = []
for fname in present:
fpath = find_file(fname)
feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
df_feat_raw = feat_book[fsheet].copy()
dfeat = find_day0_column(df_feat_raw)
tfeat = find_ticker_column(df_feat_raw)
feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)
for w in WINDOWS:
esheet = win_map.get(w)
if esheet is None:
print(f"Missing event sheet for window {w}. Skipping.")
continue
df_evt = evt_book[esheet].copy()
devt = find_day0_column(df_evt)
tevt = find_ticker_column(df_evt)
ycol = find_target_col(df_evt)
evt = df_evt.copy()
evt["__day0__"] = normalize_day0(evt[devt])
evt["__ticker__"] = normalize_ticker(evt[tevt])
evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])
merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
groups = merged["__ticker__"]
X = build_X(merged, num_cols, ycol)
y = merged[ycol].astype(float)
merge_audit.append({
"features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
"day0_features_col": dfeat, "ticker_features_col": tfeat,
"day0_event_col": devt, "ticker_event_col": tevt,
"merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
})
m = fit_and_score(X, y, groups)
m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
results.append(m)
# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)
print("\nMerge audit:")
display(pd.DataFrame(merge_audit))
res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1 vs v1.2):")
display(res_df)
print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
columns="features_file",
values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
aggfunc="first")
display(wide)
# ---------- DELTAS vs baseline v1.2 ----------
pairs = []
for w in WINDOWS:
base = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
comp = res_df[(res_df["features_file"]=="features v1.xlsx") & (res_df["window"]==w)]
if not base.empty and not comp.empty:
pairs.append({
"window": w,
"delta_cv_r_squared_v1.2_minus_v1": float(base["cross_validated_r_squared"].iloc[0] - comp["cross_validated_r_squared"].iloc[0]),
"delta_adjusted_r_squared_v1.2_minus_v1": float(base["adjusted_r_squared"].iloc[0] - comp["adjusted_r_squared"].iloc[0]),
"delta_r_squared_v1.2_minus_v1": float(base["r_squared"].iloc[0] - comp["r_squared"].iloc[0]),
"rows_used_v1.2": int(base["rows_used"].iloc[0]),
"rows_used_v1": int(comp["rows_used"].iloc[0]),
"features_used_v1.2": int(base["features_used"].iloc[0]),
"features_used_v1": int(comp["features_used"].iloc[0]),
})
if pairs:
deltas = pd.DataFrame(pairs).sort_values("window").reset_index(drop=True)
print("\nDeltas (v1.2 minus v1) — positive means v1.2 is better:")
display(deltas)
# ---------- SAVE ----------
out_dir = evt_path.parent
res_df.to_csv(out_dir / "v1_vs_v1.2_results.csv", index=False)
wide.to_csv(out_dir / "v1_vs_v1.2_comparison_table.csv")
if pairs:
deltas.to_csv(out_dir / "v1_vs_v1.2_deltas.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v1_vs_v1.2_results.csv")
print(" - v1_vs_v1.2_comparison_table.csv")
print(" - v1_vs_v1.2_deltas.csv")
Merge audit:
| features_file | features_sheet | event_sheet | window | day0_features_col | ticker_features_col | day0_event_col | ticker_event_col | merged_rows | predictor_cols | target_col | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | features v1.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 16 | CAR |
| 1 | features v1.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 16 | CAR |
| 2 | features v1.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 16 | CAR |
| 3 | features v1.2.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 4 | features v1.2.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 5 | features v1.2.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
Results (v1 vs v1.2):
| rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | features_file | features_sheet | window | |
|---|---|---|---|---|---|---|---|---|
| 0 | 129 | 8 | 0.245160 | 0.194838 | 0.068034 | features v1.2.xlsx | features | 0,1 |
| 1 | 129 | 16 | 0.303485 | 0.203983 | -0.115372 | features v1.xlsx | features | 0,1 |
| 2 | 129 | 8 | 0.201481 | 0.148246 | 0.094267 | features v1.2.xlsx | features | 0,3 |
| 3 | 129 | 16 | 0.250824 | 0.143799 | -0.155072 | features v1.xlsx | features | 0,3 |
| 4 | 129 | 8 | 0.214735 | 0.162384 | 0.121771 | features v1.2.xlsx | features | 0,5 |
| 5 | 129 | 16 | 0.257400 | 0.151314 | -0.089552 | features v1.xlsx | features | 0,5 |
Comparison table (rows = windows | columns = metrics per file):
| adjusted_r_squared | cross_validated_r_squared | r_squared | ||||
|---|---|---|---|---|---|---|
| features_file | features v1.2.xlsx | features v1.xlsx | features v1.2.xlsx | features v1.xlsx | features v1.2.xlsx | features v1.xlsx |
| window | ||||||
| 0,1 | 0.194838 | 0.203983 | 0.068034 | -0.115372 | 0.245160 | 0.303485 |
| 0,3 | 0.148246 | 0.143799 | 0.094267 | -0.155072 | 0.201481 | 0.250824 |
| 0,5 | 0.162384 | 0.151314 | 0.121771 | -0.089552 | 0.214735 | 0.257400 |
Deltas (v1.2 minus v1) — positive means v1.2 is better:
| window | delta_cv_r_squared_v1.2_minus_v1 | delta_adjusted_r_squared_v1.2_minus_v1 | delta_r_squared_v1.2_minus_v1 | rows_used_v1.2 | rows_used_v1 | features_used_v1.2 | features_used_v1 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0,1 | 0.183406 | -0.009145 | -0.058325 | 129 | 129 | 8 | 16 |
| 1 | 0,3 | 0.249339 | 0.004448 | -0.049343 | 129 | 129 | 8 | 16 |
| 2 | 0,5 | 0.211322 | 0.011070 | -0.042665 | 129 | 129 | 8 | 16 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - v1_vs_v1.2_results.csv - v1_vs_v1.2_comparison_table.csv - v1_vs_v1.2_deltas.csv
In [1]:
# === Compare features v1.2 vs v1.3 (join on day0 + ticker) ===
# Metrics per window: R^2, Adjusted R^2, Cross-validated R^2 (grouped by ticker; safe fallback)
# Saves: results, wide comparison, and deltas vs v1.2
#
# If needed first: pip install pandas numpy scikit-learn openpyxl
from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
# ---------- CONFIG ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.2.xlsx", "features v1.3.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5
# ---------- HELPERS ----------
def find_file(name: str):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(f"Could not find {name}")
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands: return next(iter(book))
def score(item):
_, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
out = {"0,1": None, "0,3": None, "0,5": None}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for nm in book:
if is_readme_sheet(nm): continue
for w, pat in pats.items():
if out[w] is None and pat.search(str(nm)): out[w] = nm
return out
def find_day0_column(df: pd.DataFrame):
strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
best, kbest = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > kbest: best, kbest = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score: best, score = c, sc
return best
def normalize_day0(s: pd.Series):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series):
return s.astype(str).str.strip().str.upper()
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.IGNORECASE)]
return c2[0] if c2 else None
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
n_groups = int(pd.Series(groups).nunique())
mdl = LinearRegression()
scores = []
if n_groups >= 2:
gkf = GroupKFold(n_splits=min(max_folds, n_groups))
splits = gkf.split(X, y, groups=groups)
else:
n = len(X)
if n < 3: return np.nan
splits = KFold(n_splits=min(3, n), shuffle=True, random_state=42).split(X, y)
for tr, te in splits:
mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
y_hat = mdl.predict(X.iloc[te].values)
yt = y.iloc[te].values
ss_res = np.sum((yt - y_hat)**2); ss_tot = np.sum((yt - np.mean(yt))**2)
scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
data = pd.concat([y, X], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
n, p = len(y_c), X_c.shape[1]
if p == 0 or n < max(10, p+2):
return dict(rows_used=int(n), features_used=int(p),
r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
mdl = LinearRegression().fit(X_c.values, y_c.values)
r2 = float(mdl.score(X_c.values, y_c.values))
adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
cv = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
return dict(rows_used=int(n), features_used=int(p),
r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)
# ---------- LOAD ----------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
present = [f for f in FEATURE_FILES if any((b/f).exists() for b in BASE_DIRS)]
assert present, "Could not find features v1.2.xlsx or features v1.3.xlsx"
merge_audit = []
results = []
for fname in present:
fpath = find_file(fname)
feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
df_feat_raw = feat_book[fsheet].copy()
dfeat = find_day0_column(df_feat_raw)
tfeat = find_ticker_column(df_feat_raw)
feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)
for w in WINDOWS:
esheet = win_map.get(w)
if esheet is None:
print(f"Missing event sheet for window {w}. Skipping.")
continue
df_evt = evt_book[esheet].copy()
devt = find_day0_column(df_evt)
tevt = find_ticker_column(df_evt)
ycol = find_target_col(df_evt)
evt = df_evt.copy()
evt["__day0__"] = normalize_day0(evt[devt])
evt["__ticker__"] = normalize_ticker(evt[tevt])
evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])
merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
groups = merged["__ticker__"]
X = build_X(merged, num_cols, ycol)
y = merged[ycol].astype(float)
merge_audit.append({
"features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
"day0_features_col": dfeat, "ticker_features_col": tfeat,
"day0_event_col": devt, "ticker_event_col": tevt,
"merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
})
m = fit_and_score(X, y, groups)
m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
results.append(m)
# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)
print("\nMerge audit:")
display(pd.DataFrame(merge_audit))
res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1.2 vs v1.3):")
display(res_df)
print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
columns="features_file",
values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
aggfunc="first")
display(wide)
# ---------- DELTAS vs baseline v1.2 (positive = v1.3 is better) ----------
pairs = []
for w in WINDOWS:
base = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
comp = res_df[(res_df["features_file"]=="features v1.3.xlsx") & (res_df["window"]==w)]
if not base.empty and not comp.empty:
pairs.append({
"window": w,
"delta_cv_r_squared_v13_minus_v12": float(comp["cross_validated_r_squared"].iloc[0] - base["cross_validated_r_squared"].iloc[0]),
"delta_adjusted_r_squared_v13_minus_v12": float(comp["adjusted_r_squared"].iloc[0] - base["adjusted_r_squared"].iloc[0]),
"delta_r_squared_v13_minus_v12": float(comp["r_squared"].iloc[0] - base["r_squared"].iloc[0]),
"rows_used_v1.2": int(base["rows_used"].iloc[0]),
"rows_used_v1.3": int(comp["rows_used"].iloc[0]),
"features_used_v1.2": int(base["features_used"].iloc[0]),
"features_used_v1.3": int(comp["features_used"].iloc[0]),
})
if pairs:
deltas = pd.DataFrame(pairs).sort_values("window").reset_index(drop=True)
print("\nDeltas (v1.3 minus v1.2) — positive means v1.3 is better:")
display(deltas)
# ---------- SAVE ----------
out_dir = evt_path.parent
res_df.to_csv(out_dir / "v1.2_vs_v1.3_results.csv", index=False)
wide.to_csv(out_dir / "v1.2_vs_v1.3_comparison_table.csv")
if pairs:
deltas.to_csv(out_dir / "v1.2_vs_v1.3_deltas.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v1.2_vs_v1.3_results.csv")
print(" - v1.2_vs_v1.3_comparison_table.csv")
print(" - v1.2_vs_v1.3_deltas.csv")
Merge audit:
| features_file | features_sheet | event_sheet | window | day0_features_col | ticker_features_col | day0_event_col | ticker_event_col | merged_rows | predictor_cols | target_col | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | features v1.2.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 1 | features v1.2.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 2 | features v1.2.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 8 | CAR |
| 3 | features v1.3.xlsx | features | CAR_(0,1) | 0,1 | day0 | ticker | day0 | ticker | 129 | 7 | CAR |
| 4 | features v1.3.xlsx | features | CAR_(0,3) | 0,3 | day0 | ticker | day0 | ticker | 129 | 7 | CAR |
| 5 | features v1.3.xlsx | features | CAR_(0,5) | 0,5 | day0 | ticker | day0 | ticker | 129 | 7 | CAR |
Results (v1.2 vs v1.3):
| rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | features_file | features_sheet | window | |
|---|---|---|---|---|---|---|---|---|
| 0 | 129 | 8 | 0.245160 | 0.194838 | 0.068034 | features v1.2.xlsx | features | 0,1 |
| 1 | 129 | 7 | 0.245005 | 0.201328 | 0.072892 | features v1.3.xlsx | features | 0,1 |
| 2 | 129 | 8 | 0.201481 | 0.148246 | 0.094267 | features v1.2.xlsx | features | 0,3 |
| 3 | 129 | 7 | 0.201430 | 0.155231 | 0.099199 | features v1.3.xlsx | features | 0,3 |
| 4 | 129 | 8 | 0.214735 | 0.162384 | 0.121771 | features v1.2.xlsx | features | 0,5 |
| 5 | 129 | 7 | 0.214615 | 0.169179 | 0.133885 | features v1.3.xlsx | features | 0,5 |
Comparison table (rows = windows | columns = metrics per file):
| adjusted_r_squared | cross_validated_r_squared | r_squared | ||||
|---|---|---|---|---|---|---|
| features_file | features v1.2.xlsx | features v1.3.xlsx | features v1.2.xlsx | features v1.3.xlsx | features v1.2.xlsx | features v1.3.xlsx |
| window | ||||||
| 0,1 | 0.194838 | 0.201328 | 0.068034 | 0.072892 | 0.245160 | 0.245005 |
| 0,3 | 0.148246 | 0.155231 | 0.094267 | 0.099199 | 0.201481 | 0.201430 |
| 0,5 | 0.162384 | 0.169179 | 0.121771 | 0.133885 | 0.214735 | 0.214615 |
Deltas (v1.3 minus v1.2) — positive means v1.3 is better:
| window | delta_cv_r_squared_v13_minus_v12 | delta_adjusted_r_squared_v13_minus_v12 | delta_r_squared_v13_minus_v12 | rows_used_v1.2 | rows_used_v1.3 | features_used_v1.2 | features_used_v1.3 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0,1 | 0.004858 | 0.006490 | -0.000155 | 129 | 129 | 8 | 7 |
| 1 | 0,3 | 0.004932 | 0.006985 | -0.000051 | 129 | 129 | 8 | 7 |
| 2 | 0,5 | 0.012114 | 0.006795 | -0.000120 | 129 | 129 | 8 | 7 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data - v1.2_vs_v1.3_results.csv - v1.2_vs_v1.3_comparison_table.csv - v1.2_vs_v1.3_deltas.csv
In [5]:
# === Compare Baseline v1.xlsx vs v1.1.xlsx vs v1.2.xlsx on event_study_2.xlsx ===
# Windows: 0,1 0,3 0,5 0,10 0,15 0,20
# Join on day0 + ticker; grouped CV by ticker; saves CSVs next to the event file.
from pathlib import Path
import re, numpy as np, pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
# ---------- CONFIG (updated base folder) ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study_2.xlsx"
FEATURE_FILES = ["Baseline v1.xlsx", "v1.1.xlsx", "v1.2.xlsx"]
WINDOWS = ["0,1","0,3","0,5","0,10","0,15","0,20"]
MAX_GROUP_FOLDS = 5
# ---------- HELPERS ----------
def find_file(name: str):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(f"Could not find: {name}")
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands: return next(iter(book))
def score(item):
_, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
out = {w: None for w in WINDOWS}
pats = {w: re.compile(rf"(car.*)?0\D*{w.split(',')[1]}(?!\d)", re.IGNORECASE) for w in WINDOWS}
for nm in book:
if is_readme_sheet(nm): continue
for w, pat in pats.items():
if out[w] is None and pat.search(str(nm)): out[w] = nm
return out
def find_day0_column(df: pd.DataFrame):
strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
best, kbest = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > kbest: best, kbest = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score: best, score = c, sc
return best
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def normalize_day0(s: pd.Series):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series):
return s.astype(str).str.strip().str.upper()
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
n_groups = int(pd.Series(groups).nunique())
mdl = LinearRegression()
scores = []
if n_groups >= 2:
gkf = GroupKFold(n_splits=min(max_folds, n_groups))
splits = gkf.split(X, y, groups=groups)
else:
n = len(X)
if n < 3: return np.nan
splits = KFold(n_splits=min(3, n), shuffle=True, random_state=42).split(X, y)
for tr, te in splits:
mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
yh = mdl.predict(X.iloc[te].values)
yt = y.iloc[te].values
ss_res = ((yt - yh)**2).sum(); ss_tot = ((yt - yt.mean())**2).sum()
scores.append(1 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
data = pd.concat([y, X], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
n, p = len(y_c), X_c.shape[1]
if p == 0 or n < max(10, p+2):
return dict(rows_used=int(n), features_used=int(p),
r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
mdl = LinearRegression().fit(X_c.values, y_c.values)
r2 = float(mdl.score(X_c.values, y_c.values))
adj = 1 - (1 - r2)*(n - 1)/(n - p - 1) if (n - p - 1) > 0 else np.nan
cv = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
return dict(rows_used=int(n), features_used=int(p),
r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)
# ---------- LOAD EVENT STUDY ----------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
# ---------- RUN ----------
merge_audit, results = [], []
for fname in FEATURE_FILES:
fpath = find_file(fname)
feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
feat_raw = feat_book[fsheet].copy()
dfeat = find_day0_column(feat_raw)
tfeat = find_ticker_column(feat_raw)
feat_g, num_cols = aggregate_features(feat_raw, dfeat, tfeat)
for w in WINDOWS:
esheet = win_map.get(w)
if esheet is None:
print(f"Skip window {w}: no matching sheet in {EVENT_FILE}.")
continue
evt_raw = evt_book[esheet].copy()
devt = find_day0_column(evt_raw)
tevt = find_ticker_column(evt_raw)
ycol = find_target_col(evt_raw)
evt = evt_raw.copy()
evt["__day0__"] = normalize_day0(evt[devt])
evt["__ticker__"] = normalize_ticker(evt[tevt])
evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])
merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
X = build_X(merged, num_cols, ycol)
y = merged[ycol].astype(float)
groups = merged["__ticker__"]
merge_audit.append({
"features_file": fname, "features_sheet": fsheet,
"event_sheet": esheet, "window": w,
"merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
})
m = fit_and_score(X, y, groups)
m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
results.append(m)
# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)
print("\nMerge audit:")
display(pd.DataFrame(merge_audit))
res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1 vs v1.1 vs v1.2):")
display(res_df)
print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
columns="features_file",
values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
aggfunc="first")
display(wide)
# Per-window winner by cross-validated R^2
winners = (res_df.sort_values(["window","cross_validated_r_squared"], ascending=[True,False])
.groupby("window").first().reset_index())
winners = winners[["window","features_file","cross_validated_r_squared","adjusted_r_squared","r_squared","rows_used","features_used"]]
print("\nBest per window (by cross-validated R^2):")
display(winners)
# ---------- SAVE ----------
out_dir = evt_path.parent
res_df.to_csv(out_dir / "v1_v1.1_v1.2_results.csv", index=False)
wide.to_csv(out_dir / "v1_v1.1_v1.2_comparison_table.csv")
winners.to_csv(out_dir / "v1_v1.1_v1.2_best_per_window.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v1_v1.1_v1.2_results.csv")
print(" - v1_v1.1_v1.2_comparison_table.csv")
print(" - v1_v1.1_v1.2_best_per_window.csv")
Merge audit:
| features_file | features_sheet | event_sheet | window | merged_rows | predictor_cols | target_col | |
|---|---|---|---|---|---|---|---|
| 0 | Baseline v1.xlsx | features | CAR_(0,1) | 0,1 | 129 | 16 | CAR |
| 1 | Baseline v1.xlsx | features | CAR_(0,3) | 0,3 | 129 | 16 | CAR |
| 2 | Baseline v1.xlsx | features | CAR_(0,5) | 0,5 | 129 | 16 | CAR |
| 3 | Baseline v1.xlsx | features | CAR_(0,10) | 0,10 | 129 | 16 | CAR |
| 4 | Baseline v1.xlsx | features | CAR_(0,15) | 0,15 | 129 | 16 | CAR |
| 5 | Baseline v1.xlsx | features | CAR_(0,20) | 0,20 | 129 | 16 | CAR |
| 6 | v1.1.xlsx | features | CAR_(0,1) | 0,1 | 129 | 8 | CAR |
| 7 | v1.1.xlsx | features | CAR_(0,3) | 0,3 | 129 | 8 | CAR |
| 8 | v1.1.xlsx | features | CAR_(0,5) | 0,5 | 129 | 8 | CAR |
| 9 | v1.1.xlsx | features | CAR_(0,10) | 0,10 | 129 | 8 | CAR |
| 10 | v1.1.xlsx | features | CAR_(0,15) | 0,15 | 129 | 8 | CAR |
| 11 | v1.1.xlsx | features | CAR_(0,20) | 0,20 | 129 | 8 | CAR |
| 12 | v1.2.xlsx | features | CAR_(0,1) | 0,1 | 129 | 7 | CAR |
| 13 | v1.2.xlsx | features | CAR_(0,3) | 0,3 | 129 | 7 | CAR |
| 14 | v1.2.xlsx | features | CAR_(0,5) | 0,5 | 129 | 7 | CAR |
| 15 | v1.2.xlsx | features | CAR_(0,10) | 0,10 | 129 | 7 | CAR |
| 16 | v1.2.xlsx | features | CAR_(0,15) | 0,15 | 129 | 7 | CAR |
| 17 | v1.2.xlsx | features | CAR_(0,20) | 0,20 | 129 | 7 | CAR |
Results (v1 vs v1.1 vs v1.2):
| rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | features_file | features_sheet | window | |
|---|---|---|---|---|---|---|---|---|
| 0 | 129 | 16 | 0.303485 | 0.203983 | -0.115372 | Baseline v1.xlsx | features | 0,1 |
| 1 | 129 | 8 | 0.245160 | 0.194838 | 0.068034 | v1.1.xlsx | features | 0,1 |
| 2 | 129 | 7 | 0.245005 | 0.201328 | 0.072892 | v1.2.xlsx | features | 0,1 |
| 3 | 129 | 16 | 0.196103 | 0.081260 | -0.852816 | Baseline v1.xlsx | features | 0,10 |
| 4 | 129 | 8 | 0.145079 | 0.088084 | -0.523390 | v1.1.xlsx | features | 0,10 |
| 5 | 129 | 7 | 0.144797 | 0.095323 | -0.509257 | v1.2.xlsx | features | 0,10 |
| 6 | 129 | 16 | 0.190205 | 0.074520 | -1.004755 | Baseline v1.xlsx | features | 0,15 |
| 7 | 129 | 8 | 0.096643 | 0.036420 | -0.489696 | v1.1.xlsx | features | 0,15 |
| 8 | 129 | 7 | 0.093859 | 0.041437 | -0.505546 | v1.2.xlsx | features | 0,15 |
| 9 | 129 | 16 | 0.390604 | 0.303547 | -0.209408 | Baseline v1.xlsx | features | 0,20 |
| 10 | 129 | 8 | 0.201758 | 0.148542 | -0.336828 | v1.1.xlsx | features | 0,20 |
| 11 | 129 | 7 | 0.198429 | 0.152057 | -0.318275 | v1.2.xlsx | features | 0,20 |
| 12 | 129 | 16 | 0.250824 | 0.143799 | -0.155072 | Baseline v1.xlsx | features | 0,3 |
| 13 | 129 | 8 | 0.201481 | 0.148246 | 0.094267 | v1.1.xlsx | features | 0,3 |
| 14 | 129 | 7 | 0.201430 | 0.155231 | 0.099199 | v1.2.xlsx | features | 0,3 |
| 15 | 129 | 16 | 0.257400 | 0.151314 | -0.089552 | Baseline v1.xlsx | features | 0,5 |
| 16 | 129 | 8 | 0.214735 | 0.162384 | 0.121771 | v1.1.xlsx | features | 0,5 |
| 17 | 129 | 7 | 0.214615 | 0.169179 | 0.133885 | v1.2.xlsx | features | 0,5 |
Comparison table (rows = windows | columns = metrics per file):
| adjusted_r_squared | cross_validated_r_squared | r_squared | |||||||
|---|---|---|---|---|---|---|---|---|---|
| features_file | Baseline v1.xlsx | v1.1.xlsx | v1.2.xlsx | Baseline v1.xlsx | v1.1.xlsx | v1.2.xlsx | Baseline v1.xlsx | v1.1.xlsx | v1.2.xlsx |
| window | |||||||||
| 0,1 | 0.203983 | 0.194838 | 0.201328 | -0.115372 | 0.068034 | 0.072892 | 0.303485 | 0.245160 | 0.245005 |
| 0,10 | 0.081260 | 0.088084 | 0.095323 | -0.852816 | -0.523390 | -0.509257 | 0.196103 | 0.145079 | 0.144797 |
| 0,15 | 0.074520 | 0.036420 | 0.041437 | -1.004755 | -0.489696 | -0.505546 | 0.190205 | 0.096643 | 0.093859 |
| 0,20 | 0.303547 | 0.148542 | 0.152057 | -0.209408 | -0.336828 | -0.318275 | 0.390604 | 0.201758 | 0.198429 |
| 0,3 | 0.143799 | 0.148246 | 0.155231 | -0.155072 | 0.094267 | 0.099199 | 0.250824 | 0.201481 | 0.201430 |
| 0,5 | 0.151314 | 0.162384 | 0.169179 | -0.089552 | 0.121771 | 0.133885 | 0.257400 | 0.214735 | 0.214615 |
Best per window (by cross-validated R^2):
| window | features_file | cross_validated_r_squared | adjusted_r_squared | r_squared | rows_used | features_used | |
|---|---|---|---|---|---|---|---|
| 0 | 0,1 | v1.2.xlsx | 0.072892 | 0.201328 | 0.245005 | 129 | 7 |
| 1 | 0,10 | v1.2.xlsx | -0.509257 | 0.095323 | 0.144797 | 129 | 7 |
| 2 | 0,15 | v1.1.xlsx | -0.489696 | 0.036420 | 0.096643 | 129 | 8 |
| 3 | 0,20 | Baseline v1.xlsx | -0.209408 | 0.303547 | 0.390604 | 129 | 16 |
| 4 | 0,3 | v1.2.xlsx | 0.099199 | 0.155231 | 0.201430 | 129 | 7 |
| 5 | 0,5 | v1.2.xlsx | 0.133885 | 0.169179 | 0.214615 | 129 | 7 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model - v1_v1.1_v1.2_results.csv - v1_v1.1_v1.2_comparison_table.csv - v1_v1.1_v1.2_best_per_window.csv
In [7]:
# === Compare Baseline v1.xlsx vs v1.1.xlsx vs v1.2.xlsx on event_study.xlsx ===
# Windows: 0,1 0,3 0,5
# Join on day0 + ticker; grouped CV by ticker; saves CSVs next to the event file.
from pathlib import Path
import re, numpy as np, pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
# ---------- CONFIG ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["Baseline v1.xlsx", "v1.1.xlsx", "v1.2.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5
# ---------- HELPERS ----------
def find_file(name: str):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(f"Could not find: {name}")
def is_readme_sheet(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def choose_features_sheet(book: dict) -> str:
cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
if not cands: return next(iter(book))
def score(item):
_, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_event_window_sheets(book: dict):
out = {w: None for w in WINDOWS}
pats = {
"0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
"0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
"0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
}
for nm in book:
if is_readme_sheet(nm): continue
for w, pat in pats.items():
if out[w] is None and pat.search(str(nm)): out[w] = nm
return out
def find_day0_column(df: pd.DataFrame):
strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), flags=re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
best, kbest = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > kbest: best, kbest = c, k
return best
def find_ticker_column(df: pd.DataFrame):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score: best, score = c, sc
return best
def find_target_col(df: pd.DataFrame):
c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
if c1: return c1[0]
c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
return c2[0] if c2 else None
def normalize_day0(s: pd.Series):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def normalize_ticker(s: pd.Series):
return s.astype(str).str.strip().str.upper()
def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
df = df_feat_raw.copy()
df["__day0__"] = normalize_day0(df[day0_col])
df["__ticker__"] = normalize_ticker(df[ticker_col])
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
g = g.dropna(subset=["__day0__","__ticker__"])
return g, num_cols
def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].copy()
X = X.drop(columns=[target_col], errors="ignore")
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
n_groups = int(pd.Series(groups).nunique())
mdl = LinearRegression()
scores = []
if n_groups >= 2:
gkf = GroupKFold(n_splits=min(max_folds, n_groups))
splits = gkf.split(X, y, groups=groups)
else:
n = len(X)
if n < 3: return np.nan
splits = KFold(n_splits=min(3, n), shuffle=True, random_state=42).split(X, y)
for tr, te in splits:
mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
yh = mdl.predict(X.iloc[te].values)
yt = y.iloc[te].values
ss_res = ((yt - yh)**2).sum(); ss_tot = ((yt - yt.mean())**2).sum()
scores.append(1 - ss_res/ss_tot if ss_tot > 0 else np.nan)
return float(np.nanmean(scores))
def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
data = pd.concat([y, X], axis=1).dropna()
y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
n, p = len(y_c), X_c.shape[1]
if p == 0 or n < max(10, p+2):
return dict(rows_used=int(n), features_used=int(p),
r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
mdl = LinearRegression().fit(X_c.values, y_c.values)
r2 = float(mdl.score(X_c.values, y_c.values))
adj = 1 - (1 - r2)*(n - 1)/(n - p - 1) if (n - p - 1) > 0 else np.nan
cv = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
return dict(rows_used=int(n), features_used=int(p),
r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)
# ---------- LOAD ----------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)
# ---------- RUN ----------
merge_audit, results = [], []
for fname in FEATURE_FILES:
fpath = find_file(fname)
feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
feat_raw = feat_book[fsheet].copy()
dfeat = find_day0_column(feat_raw)
tfeat = find_ticker_column(feat_raw)
feat_g, num_cols = aggregate_features(feat_raw, dfeat, tfeat)
for w in WINDOWS:
esheet = win_map.get(w)
if esheet is None:
print(f"Skip window {w}: no matching sheet in {EVENT_FILE}.")
continue
evt_raw = evt_book[esheet].copy()
devt = find_day0_column(evt_raw)
tevt = find_ticker_column(evt_raw)
ycol = find_target_col(evt_raw)
evt = evt_raw.copy()
evt["__day0__"] = normalize_day0(evt[devt])
evt["__ticker__"] = normalize_ticker(evt[tevt])
evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])
merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
X = build_X(merged, num_cols, ycol)
y = merged[ycol].astype(float)
groups = merged["__ticker__"]
merge_audit.append({
"features_file": fname, "features_sheet": fsheet,
"event_sheet": esheet, "window": w,
"merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
})
m = fit_and_score(X, y, groups)
m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
results.append(m)
# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)
print("\nMerge audit:")
display(pd.DataFrame(merge_audit))
res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1 vs v1.1 vs v1.2) — windows 0,1 / 0,3 / 0,5:")
display(res_df)
print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
columns="features_file",
values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
aggfunc="first")
display(wide)
# Per-window winner by cross-validated R^2
winners = (res_df.sort_values(["window","cross_validated_r_squared"], ascending=[True,False])
.groupby("window").first().reset_index())
winners = winners[["window","features_file","cross_validated_r_squared","adjusted_r_squared","r_squared","rows_used","features_used"]]
print("\nBest per window (by cross-validated R^2):")
display(winners)
# ---------- SAVE ----------
out_dir = evt_path.parent
res_df.to_csv(out_dir / "v1_v1.1_v1.2_results_windows_0_1_0_3_0_5.csv", index=False)
wide.to_csv(out_dir / "v1_v1.1_v1.2_comparison_windows_0_1_0_3_0_5.csv")
winners.to_csv(out_dir / "v1_v1.1_v1.2_best_per_window_0_1_0_3_0_5.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v1_v1.1_v1.2_results_windows_0_1_0_3_0_5.csv")
print(" - v1_v1.1_v1.2_comparison_windows_0_1_0_3_0_5.csv")
print(" - v1_v1.1_v1.2_best_per_window_0_1_0_3_0_5.csv")
Merge audit:
| features_file | features_sheet | event_sheet | window | merged_rows | predictor_cols | target_col | |
|---|---|---|---|---|---|---|---|
| 0 | Baseline v1.xlsx | features | CAR_(0,1) | 0,1 | 129 | 16 | CAR |
| 1 | Baseline v1.xlsx | features | CAR_(0,3) | 0,3 | 129 | 16 | CAR |
| 2 | Baseline v1.xlsx | features | CAR_(0,5) | 0,5 | 129 | 16 | CAR |
| 3 | v1.1.xlsx | features | CAR_(0,1) | 0,1 | 129 | 8 | CAR |
| 4 | v1.1.xlsx | features | CAR_(0,3) | 0,3 | 129 | 8 | CAR |
| 5 | v1.1.xlsx | features | CAR_(0,5) | 0,5 | 129 | 8 | CAR |
| 6 | v1.2.xlsx | features | CAR_(0,1) | 0,1 | 129 | 7 | CAR |
| 7 | v1.2.xlsx | features | CAR_(0,3) | 0,3 | 129 | 7 | CAR |
| 8 | v1.2.xlsx | features | CAR_(0,5) | 0,5 | 129 | 7 | CAR |
Results (v1 vs v1.1 vs v1.2) — windows 0,1 / 0,3 / 0,5:
| rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | features_file | features_sheet | window | |
|---|---|---|---|---|---|---|---|---|
| 0 | 129 | 16 | 0.303485 | 0.203983 | -0.115372 | Baseline v1.xlsx | features | 0,1 |
| 1 | 129 | 8 | 0.245160 | 0.194838 | 0.068034 | v1.1.xlsx | features | 0,1 |
| 2 | 129 | 7 | 0.245005 | 0.201328 | 0.072892 | v1.2.xlsx | features | 0,1 |
| 3 | 129 | 16 | 0.250824 | 0.143799 | -0.155072 | Baseline v1.xlsx | features | 0,3 |
| 4 | 129 | 8 | 0.201481 | 0.148246 | 0.094267 | v1.1.xlsx | features | 0,3 |
| 5 | 129 | 7 | 0.201430 | 0.155231 | 0.099199 | v1.2.xlsx | features | 0,3 |
| 6 | 129 | 16 | 0.257400 | 0.151314 | -0.089552 | Baseline v1.xlsx | features | 0,5 |
| 7 | 129 | 8 | 0.214735 | 0.162384 | 0.121771 | v1.1.xlsx | features | 0,5 |
| 8 | 129 | 7 | 0.214615 | 0.169179 | 0.133885 | v1.2.xlsx | features | 0,5 |
Comparison table (rows = windows | columns = metrics per file):
| adjusted_r_squared | cross_validated_r_squared | r_squared | |||||||
|---|---|---|---|---|---|---|---|---|---|
| features_file | Baseline v1.xlsx | v1.1.xlsx | v1.2.xlsx | Baseline v1.xlsx | v1.1.xlsx | v1.2.xlsx | Baseline v1.xlsx | v1.1.xlsx | v1.2.xlsx |
| window | |||||||||
| 0,1 | 0.203983 | 0.194838 | 0.201328 | -0.115372 | 0.068034 | 0.072892 | 0.303485 | 0.245160 | 0.245005 |
| 0,3 | 0.143799 | 0.148246 | 0.155231 | -0.155072 | 0.094267 | 0.099199 | 0.250824 | 0.201481 | 0.201430 |
| 0,5 | 0.151314 | 0.162384 | 0.169179 | -0.089552 | 0.121771 | 0.133885 | 0.257400 | 0.214735 | 0.214615 |
Best per window (by cross-validated R^2):
| window | features_file | cross_validated_r_squared | adjusted_r_squared | r_squared | rows_used | features_used | |
|---|---|---|---|---|---|---|---|
| 0 | 0,1 | v1.2.xlsx | 0.072892 | 0.201328 | 0.245005 | 129 | 7 |
| 1 | 0,3 | v1.2.xlsx | 0.099199 | 0.155231 | 0.201430 | 129 | 7 |
| 2 | 0,5 | v1.2.xlsx | 0.133885 | 0.169179 | 0.214615 | 129 | 7 |
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model - v1_v1.1_v1.2_results_windows_0_1_0_3_0_5.csv - v1_v1.1_v1.2_comparison_windows_0_1_0_3_0_5.csv - v1_v1.1_v1.2_best_per_window_0_1_0_3_0_5.csv
In [9]:
# === Visualise v1 vs v1.1 vs v1.2 on windows 0,1 / 0,3 / 0,5 ===
# Requires: pandas, numpy, scikit-learn, openpyxl, matplotlib
# pip install pandas numpy scikit-learn openpyxl matplotlib
from pathlib import Path
import re, numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
# ----------------- CONFIG -----------------
BASE_DIRS = [Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
Path("."), Path("/mnt/data")]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = {
"Baseline v1.xlsx": "v1",
"v1.1.xlsx": "v1.1",
"v1.2.xlsx": "v1.2",
}
WINDOWS = ["0,1", "0,3", "0,5"]
MAX_GROUP_FOLDS = 5
# colours for models (distinct)
MODEL_COLOURS = {"v1":"#1f77b4", "v1.1":"#ff7f0e", "v1.2":"#2ca02c"}
# ----------------- HELPERS -----------------
def find_file(name):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(f"Could not find: {name}")
def is_readme(name):
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))
def choose_features_sheet(book):
cands = [(n, df) for n, df in book.items() if not is_readme(n)]
if not cands: return next(iter(book))
def score(x):
n, df = x
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def window_sheets(book):
out = {"0,1":None,"0,3":None,"0,5":None}
pats = {"0,1":r"(car.*)?0\D*1(?!\d)","0,3":r"(car.*)?0\D*3(?!\d)","0,5":r"(car.*)?0\D*5(?!\d)"}
for nm in book:
if is_readme(nm): continue
for w,pat in pats.items():
if out[w] is None and re.search(pat, str(nm), re.I): out[w]=nm
return out
def find_day0(df):
strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
if strict: return strict[0]
for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
best,k=None,-1
for c in df.columns:
kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
if kk>k: best,k=c,kk
return best
def find_ticker(df):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best,score=None,-1
for c in obj:
s=df[c].astype(str).str.strip()
sc=s.nunique() - 0.1*s.str.len().mean()
if sc>score: best,score=c,sc
return best
def find_target(df):
c = [c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
if c: return c[0]
c = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
return c[0] if c else None
def norm_day0(s):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def norm_ticker(s): return s.astype(str).str.strip().str.upper()
def group_numeric(df, dcol, tcol):
g = df.copy()
g["__day0__"] = norm_day0(g[dcol]); g["__tic__"] = norm_ticker(g[tcol])
nums = g.select_dtypes(include=[np.number]).columns.tolist()
g = (g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
.dropna(subset=["__day0__","__tic__"]))
return g, nums
def build_X(merged, cols, ycol):
keep=[c for c in cols if c in merged.columns]
X = merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
nunq = X.nunique(dropna=False)
return X.loc[:, nunq>1]
def cv_r2_and_oof_preds(X, y, groups):
n_groups = int(pd.Series(groups).nunique())
if len(X)<3: return np.nan, np.full(len(y), np.nan)
if n_groups >= 2:
splitter = GroupKFold(n_splits=min(MAX_GROUP_FOLDS, n_groups))
splits = splitter.split(X, y, groups=groups)
else:
splitter = KFold(n_splits=min(3, len(X)), shuffle=True, random_state=42)
splits = splitter.split(X, y)
oof = np.full(len(y), np.nan)
scores=[]
for tr,te in splits:
m = LinearRegression().fit(X.iloc[tr].values, y.iloc[tr].values)
yh = m.predict(X.iloc[te].values)
yt = y.iloc[te].values
oof[te] = yh
ss_res = np.sum((yt - yh)**2)
ss_tot = np.sum((yt - yt.mean())**2)
scores.append(1 - ss_res/ss_tot if ss_tot>0 else np.nan)
return float(np.nanmean(scores)), oof
def insample_r2_adj(X, y):
m = LinearRegression().fit(X.values, y.values)
r2 = float(m.score(X.values, y.values))
n,p = len(y), X.shape[1]
adj = 1 - (1 - r2)*(n - 1)/(n - p - 1) if (n - p - 1)>0 else np.nan
return r2, adj
# ----------------- LOAD -----------------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)
all_rows=[]
oof_store={} # (model, window) -> (y_true, y_oof)
for ffile, tag in [(find_file(k), v) for k,v in FEATURE_FILES.items()]:
f_book = pd.read_excel(ffile, sheet_name=None, engine="openpyxl")
f_sheet = choose_features_sheet(f_book)
raw = f_book[f_sheet].copy()
dcol, tcol = find_day0(raw), find_ticker(raw)
feat_g, feat_cols = group_numeric(raw, dcol, tcol)
for w in WINDOWS:
es = win_map[w]
ev = evt_book[es].copy()
ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
ev["__day0__"] = norm_day0(ev[ed]); ev["__tic__"] = norm_ticker(ev[et])
ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])
merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")
X = build_X(merged, feat_cols, ycol)
y = merged[ycol].astype(float)
groups = merged["__tic__"]
if len(X)==0:
all_rows.append({"model":tag,"window":w,"rows_used":0,"features_used":0,
"r_squared":np.nan,"adjusted_r_squared":np.nan,"cross_validated_r_squared":np.nan})
continue
r2, adj = insample_r2_adj(X, y)
cv, oof = cv_r2_and_oof_preds(X, y, groups)
all_rows.append({"model":tag,"window":w,"rows_used":len(X),"features_used":X.shape[1],
"r_squared":r2,"adjusted_r_squared":adj,"cross_validated_r_squared":cv})
oof_store[(tag,w)] = (y.values, oof)
res = pd.DataFrame(all_rows).sort_values(["window","model"]).reset_index(drop=True)
# Save results
out_dir = find_file(EVENT_FILE).parent
res.to_csv(out_dir/"viz_v1_v1.1_v1.2_results.csv", index=False)
# ----------------- PLOTS -----------------
# Bar charts: cross-validated coefficient of determination and adjusted coefficient of determination
for metric, title in [("cross_validated_r_squared","Cross-validated R^2"),
("adjusted_r_squared","Adjusted R^2")]:
fig = plt.figure(figsize=(8,5))
idx = np.arange(len(WINDOWS))
width = 0.22
offsets = {"v1":-width, "v1.1":0.0, "v1.2":width}
for model in ["v1","v1.1","v1.2"]:
vals = [float(res[(res.window==w)&(res.model==model)][metric]) for w in WINDOWS]
plt.bar(idx + offsets[model], vals, width, label=model, color=MODEL_COLOURS[model])
plt.xticks(idx, WINDOWS)
plt.ylabel(title)
plt.title(f"{title} — v1 vs v1.1 vs v1.2")
plt.legend()
plt.tight_layout()
fig.savefig(out_dir/f"{metric}_bars_v1_v11_v12.png", dpi=150)
plt.show()
# Line graph: coefficient of determination across windows
fig = plt.figure(figsize=(8,5))
for model in ["v1","v1.1","v1.2"]:
vals = [float(res[(res.window==w)&(res.model==model)]["r_squared"]) for w in WINDOWS]
plt.plot(WINDOWS, vals, marker="o", label=model, color=MODEL_COLOURS[model])
plt.ylabel("R^2")
plt.title("R^2 across windows — v1 vs v1.1 vs v1.2")
plt.legend()
plt.tight_layout()
fig.savefig(out_dir/"r2_lines_v1_v11_v12.png", dpi=150)
plt.show()
# Scatter plots with line of best fit (out-of-fold predictions) for each model and window
def scatter_with_fit(y_true, y_pred, title, save_path):
fig = plt.figure(figsize=(5,5))
plt.scatter(y_true, y_pred, s=18, alpha=0.7)
# best fit line (y_pred on y_true)
ok = np.isfinite(y_true) & np.isfinite(y_pred)
if ok.sum() >= 2:
a,b = np.polyfit(y_true[ok], y_pred[ok], 1)
xs = np.linspace(np.nanmin(y_true[ok]), np.nanmax(y_true[ok]), 100)
plt.plot(xs, a*xs + b, linestyle="--")
# 45-degree reference
lim = np.nanmax(np.abs(np.concatenate([y_true[ok], y_pred[ok]]))) if ok.any() else 1.0
lim = float(lim)*1.05
plt.plot([-lim, lim], [-lim, lim], linestyle=":")
plt.xlim(-lim, lim); plt.ylim(-lim, lim)
plt.xlabel("Actual CAR")
plt.ylabel("Predicted CAR (OOF)")
plt.title(title)
plt.tight_layout()
fig.savefig(save_path, dpi=150)
plt.show()
for model in ["v1","v1.1","v1.2"]:
for w in WINDOWS:
if (model, w) in oof_store:
y_true, y_oof = oof_store[(model,w)]
scatter_with_fit(y_true, y_oof,
title=f"{model} — window {w} (OOF)",
save_path=out_dir/f"scatter_oof_{model}_window_{w.replace(',','_')}.png")
print(f"Saved figures and CSV in: {out_dir}")
C:\Users\dcazo\AppData\Local\Temp\ipykernel_22912\4005354297.py:185: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead vals = [float(res[(res.window==w)&(res.model==model)][metric]) for w in WINDOWS]
C:\Users\dcazo\AppData\Local\Temp\ipykernel_22912\4005354297.py:185: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead vals = [float(res[(res.window==w)&(res.model==model)][metric]) for w in WINDOWS]
C:\Users\dcazo\AppData\Local\Temp\ipykernel_22912\4005354297.py:198: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead vals = [float(res[(res.window==w)&(res.model==model)]["r_squared"]) for w in WINDOWS]
Saved figures and CSV in: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model
In [2]:
# === Feature importance for v1.2 (7 features) with grouped cross validation ===
# Tests: permutation drop, leave-one-feature-out drop, mean abs standardized coefficient
# Join on day0 + ticker; no engineered event columns as predictors
# Saves a CSV per window and prints a sorted table
#
# pip install pandas numpy scikit-learn openpyxl matplotlib
from pathlib import Path
import re, numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
# ---------------- CONFIG ----------------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURES_FILE = "v1.2.xlsx" # your baseline file with the 7 features
WINDOWS_TO_SCORE = ["0,5"] # change to ["0,1","0,3","0,5"] if you want all
MAX_GROUP_FOLDS = 5
FORCE_FEATURES = None # put a list here if you want to force exactly 7 feature names
# -------------- HELPERS --------------
def find_file(name):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(name)
def is_readme(name):
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))
def choose_features_sheet(book):
cands = [(n, df) for n, df in book.items() if not is_readme(n)]
if not cands: return next(iter(book))
def score(x):
_, df = x
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def window_sheets(book):
out = {"0,1":None,"0,3":None,"0,5":None}
pats = {"0,1":r"(car.*)?0\D*1(?!\d)","0,3":r"(car.*)?0\D*3(?!\d)","0,5":r"(car.*)?0\D*5(?!\d)"}
for nm in book:
if is_readme(nm): continue
for w,pat in pats.items():
if out[w] is None and re.search(pat, str(nm), re.I): out[w]=nm
return out
def find_day0(df):
s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
if s: return s[0]
for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
"date","Date","trading_date","TradingDate","date0","Date0","DATE0"]:
if c in df.columns: return c
# fallback: most date-like
best,k=None,-1
for c in df.columns:
kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
if kk>k: best,k=c,kk
return best
def find_ticker(df):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best,score=None,-1
for c in obj:
s=df[c].astype(str).str.strip()
sc=s.nunique() - 0.1*s.str.len().mean()
if sc>score: best,score=c,sc
return best
def find_target(df):
c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
if c: return c[0]
c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
return c[0] if c else None
def norm_day0(s):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def norm_tic(s): return s.astype(str).str.strip().str.upper()
def group_numeric(df, dcol, tcol):
g=df.copy()
g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
nums=g.select_dtypes(include=[np.number]).columns.tolist()
g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
.dropna(subset=["__day0__","__tic__"]))
return g, nums
def build_X(merged, cols, ycol):
keep=[c for c in cols if c in merged.columns]
X=merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
nunq=X.nunique(dropna=False)
return X.loc[:, nunq>1]
def grouped_splits(X, y, groups):
ng = int(pd.Series(groups).nunique())
if ng>=2:
gkf = GroupKFold(n_splits=min(MAX_GROUP_FOLDS, ng))
return list(gkf.split(X, y, groups=groups))
# fallback when groups are too few
k = min(3, len(X))
return list(KFold(n_splits=k, shuffle=True, random_state=42).split(X, y))
def fit_and_score(Xtr, ytr, Xte, yte):
# standardise in train only
scaler = StandardScaler()
Xtr_s = scaler.fit_transform(Xtr.values)
Xte_s = scaler.transform(Xte.values)
m = LinearRegression().fit(Xtr_s, ytr.values)
# test coefficient of determination
yh = m.predict(Xte_s)
ss_res = ((yte.values - yh)**2).sum()
ss_tot = ((yte.values - yte.values.mean())**2).sum()
r2 = 1 - ss_res/ss_tot if ss_tot>0 else np.nan
return r2, m.coef_, scaler
# -------------- LOAD --------------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)
feat_book = pd.read_excel(find_file(FEATURES_FILE), sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
feat_raw = feat_book[fsheet].copy()
dcol, tcol = find_day0(feat_raw), find_ticker(feat_raw)
feat_g, feat_cols_all = group_numeric(feat_raw, dcol, tcol)
# If you want to force exactly seven features, list them in FORCE_FEATURES
if FORCE_FEATURES:
feat_cols = [c for c in FORCE_FEATURES if c in feat_g.columns]
else:
# use all numeric columns in v1.2 after dropping the target later
feat_cols = feat_cols_all
print("Detected feature candidates:", feat_cols)
results_all = []
for w in WINDOWS_TO_SCORE:
es = win_map[w]
ev = evt_book[es].copy()
ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])
merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")
X = build_X(merged, feat_cols, ycol)
y = merged[ycol].astype(float)
groups = merged["__tic__"]
# If more than seven made it through, keep the seven with most variance
if X.shape[1] > 7:
var_rank = X.var().sort_values(ascending=False).index.tolist()
X = X[var_rank[:7]]
feature_list = X.columns.tolist()
# Cross validated baseline and fold objects
splits = grouped_splits(X, y, groups)
# 1) Permutation importance (drop in test coefficient of determination when permuted on test)
perm_drops = {f: [] for f in feature_list}
# 2) Standardized coefficients (mean absolute across folds)
coef_collection = {f: [] for f in feature_list}
# Compute baseline per fold and permutation drops
for tr, te in splits:
r2_base, coef, scaler = fit_and_score(X.iloc[tr], y.iloc[tr], X.iloc[te], y.iloc[te])
# map coefficients back to feature names (after scaling)
for f, c in zip(feature_list, coef):
coef_collection[f].append(abs(float(c)))
# permutation on the test slice only
Xte = X.iloc[te].copy()
for f in feature_list:
Xperm = Xte.copy()
Xperm[f] = np.random.permutation(Xperm[f].values) # break the link
# reuse the same scaler and model? No. Refit on train only to keep it honest.
r2_perm, _, _ = fit_and_score(X.iloc[tr], y.iloc[tr], Xperm, y.iloc[te])
drop = (r2_base - r2_perm) if (r2_base is not np.nan and r2_perm is not np.nan) else np.nan
perm_drops[f].append(drop)
perm_mean = {f: float(np.nanmean(v)) for f, v in perm_drops.items()}
coef_mean = {f: float(np.nanmean(v)) for f, v in coef_collection.items()}
# 3) Leave-one-feature-out cross validated coefficient of determination drop
# baseline cross validated coefficient of determination with all features
def cv_r2(Xfull):
scores=[]
for tr, te in splits:
r2, _, _ = fit_and_score(Xfull.iloc[tr], y.iloc[tr], Xfull.iloc[te], y.iloc[te])
scores.append(r2)
return float(np.nanmean(scores))
base_cv = cv_r2(X)
lofo_drop = {}
for f in feature_list:
X_minus = X.drop(columns=[f])
lofo_drop[f] = base_cv - cv_r2(X_minus)
# Build importance table
imp = pd.DataFrame({
"feature": feature_list,
"permutation_drop_in_test_coefficient_of_determination": [perm_mean[f] for f in feature_list],
"leave_one_out_drop_in_cross_validated_coefficient_of_determination": [lofo_drop[f] for f in feature_list],
"mean_abs_standardized_coefficient": [coef_mean[f] for f in feature_list],
})
# Ranks (1 = most important)
for col in ["permutation_drop_in_test_coefficient_of_determination",
"leave_one_out_drop_in_cross_validated_coefficient_of_determination",
"mean_abs_standardized_coefficient"]:
imp[f"rank_{col}"] = imp[col].rank(ascending=False, method="min")
imp["aggregate_rank"] = imp[[c for c in imp.columns if c.startswith("rank_")]].mean(axis=1)
imp = imp.sort_values("aggregate_rank").reset_index(drop=True)
print(f"\nWindow {w} — baseline cross validated coefficient of determination with all seven: {base_cv:.4f}")
display(imp)
out_path = find_file(EVENT_FILE).parent / f"v12_feature_importance_window_{w.replace(',','_')}.csv"
imp.to_csv(out_path, index=False)
print("Saved:", out_path)
Detected feature candidates: ['eps_surprise_pct', 'pre_ret_3d', 'pre_vol_5d', 'mkt_ret_5d_lag1', 'macro_us10y', 'vix_level_lag1', 'vix_chg_5d_lag1'] Window 0,5 — baseline cross validated coefficient of determination with all seven: 0.1339
| feature | permutation_drop_in_test_coefficient_of_determination | leave_one_out_drop_in_cross_validated_coefficient_of_determination | mean_abs_standardized_coefficient | rank_permutation_drop_in_test_coefficient_of_determination | rank_leave_one_out_drop_in_cross_validated_coefficient_of_determination | rank_mean_abs_standardized_coefficient | aggregate_rank | |
|---|---|---|---|---|---|---|---|---|
| 0 | pre_ret_3d | 0.147469 | 0.149779 | 0.022378 | 1.0 | 1.0 | 1.0 | 1.000000 |
| 1 | eps_surprise_pct | 0.100898 | 0.045600 | 0.017527 | 2.0 | 2.0 | 2.0 | 2.000000 |
| 2 | vix_chg_5d_lag1 | 0.072175 | 0.015763 | 0.017332 | 3.0 | 5.0 | 3.0 | 3.666667 |
| 3 | macro_us10y | 0.026367 | 0.025868 | 0.010407 | 5.0 | 3.0 | 4.0 | 4.000000 |
| 4 | vix_level_lag1 | 0.043007 | 0.021924 | 0.007735 | 4.0 | 4.0 | 5.0 | 4.333333 |
| 5 | mkt_ret_5d_lag1 | 0.021462 | -0.003507 | 0.003803 | 6.0 | 7.0 | 6.0 | 6.333333 |
| 6 | pre_vol_5d | 0.010645 | 0.000146 | 0.003343 | 7.0 | 6.0 | 7.0 | 6.666667 |
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\v12_feature_importance_window_0_5.csv
In [1]:
# ===== Test features_new.csv against event_study.xlsx on windows 0,1 / 0,3 / 0,5 =====
# Requirements: pandas, numpy, scikit-learn, openpyxl, matplotlib
import re, numpy as np, pandas as pd
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
import matplotlib.pyplot as plt
# ---- Paths (edit if your files live elsewhere) ----
DATA_DIR = Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model")
EVENT_FILE = DATA_DIR / "event_study.xlsx"
FEATURES_FILE = DATA_DIR / "features_new.csv" # new file you just shared
WINDOWS = ["0,1","0,3","0,5"] # test these three
MAX_GROUP_FOLDS = 5
# ---- Helpers ----
def is_readme(name: str) -> bool:
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))
def window_sheets(book: dict):
out = {"0,1":None,"0,3":None,"0,5":None}
pats = {"0,1":r"(car.*)?0\D*1(?!\d)", "0,3":r"(car.*)?0\D*3(?!\d)", "0,5":r"(car.*)?0\D*5(?!\d)"}
for nm in book:
if is_readme(nm):
continue
for w, pat in pats.items():
if out[w] is None and re.search(pat, str(nm), re.IGNORECASE):
out[w] = nm
return out
def find_day0(df: pd.DataFrame) -> str:
strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.IGNORECASE)]
if strict: return strict[0]
for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
"date","Date","trading_date","TradingDate","date0","Date0","DATE0"]:
if c in df.columns: return c
best, kbest = None, -1
for c in df.columns:
k = pd.to_datetime(df[c], errors="coerce").notna().sum()
if k > kbest: best, kbest = c, k
return best
def find_ticker(df: pd.DataFrame) -> str:
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best, score = None, -1
for c in obj:
s = df[c].astype(str).str.strip()
sc = s.nunique() - 0.1*s.str.len().mean()
if sc > score: best, score = c, sc
return best
def find_target(df: pd.DataFrame) -> str:
c = [c for c in df.columns if re.search(r"\bcar\b", str(c), re.IGNORECASE)]
if c: return c[0]
c = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.IGNORECASE)]
return c[0] if c else None
def norm_day0(s: pd.Series):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def norm_tic(s: pd.Series):
return s.astype(str).str.strip().str.upper()
def group_numeric_by_day0_tic(df: pd.DataFrame, dcol: str, tcol: str):
g = df.copy()
g["__day0__"] = norm_day0(g[dcol])
g["__tic__"] = norm_tic(g[tcol])
nums = g.select_dtypes(include=[np.number]).columns.tolist()
g = (g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
.dropna(subset=["__day0__","__tic__"]))
return g, nums
def build_X(merged: pd.DataFrame, numeric_cols: list, ycol: str):
keep = [c for c in numeric_cols if c in merged.columns]
X = merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
# drop constants
nunq = X.nunique(dropna=False)
return X.loc[:, nunq > 1]
def adjusted_r2(X: pd.DataFrame, y: pd.Series, r2_value: float):
n, p = len(y), X.shape[1]
if n - p - 1 <= 0:
return np.nan
return 1 - (1 - r2_value) * (n - 1) / (n - p - 1)
def grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
n_groups = int(pd.Series(groups).nunique())
if len(X) < 3:
return np.nan
if n_groups >= 2:
gkf = GroupKFold(n_splits=min(max_folds, n_groups))
splits = gkf.split(X, y, groups=groups)
else:
kf = KFold(n_splits=min(3, len(X)), shuffle=True, random_state=42)
splits = kf.split(X, y)
scores = []
for tr, te in splits:
m = LinearRegression().fit(X.iloc[tr].values, y.iloc[tr].values)
yh = m.predict(X.iloc[te].values)
yt = y.iloc[te].values
ss_res = np.sum((yt - yh)**2)
ss_tot = np.sum((yt - yt.mean())**2)
scores.append(1 - ss_res/ss_tot if ss_tot>0 else np.nan)
return float(np.nanmean(scores))
# ---- Load data ----
evt_book = pd.read_excel(EVENT_FILE, sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)
# features_new.csv can have many columns; we will detect day0 and ticker first
feat_raw = pd.read_csv(FEATURES_FILE)
dcol, tcol = find_day0(feat_raw), find_ticker(feat_raw)
features_grouped, numeric_cols = group_numeric_by_day0_tic(feat_raw, dcol, tcol)
# ---- Score per window ----
rows = []
for w in WINDOWS:
es = win_map[w]
if es is None:
print(f"Could not find a sheet for window {w}. Skipping.")
continue
ev = evt_book[es].copy()
ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
ev["__day0__"] = norm_day0(ev[ed])
ev["__tic__"] = norm_tic(ev[et])
ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])
merged = features_grouped.merge(ev[["__day0__","__tic__", ycol]],
on=["__day0__","__tic__"], how="inner")
X = build_X(merged, numeric_cols, ycol)
y = merged[ycol].astype(float)
groups = merged["__tic__"]
if X.shape[1] == 0 or len(y) < max(10, X.shape[1] + 2):
rows.append({"model":"features_new.csv","window":w,"rows_used":len(y),
"features_used":X.shape[1],
"r_squared":np.nan,
"adjusted_r_squared":np.nan,
"cross_validated_r_squared":np.nan})
continue
# In-sample
lr = LinearRegression().fit(X.values, y.values)
r2_in = float(lr.score(X.values, y.values))
adj_in = float(adjusted_r2(X, y, r2_in))
# Out-of-sample (grouped)
cv_out = grouped_cv_r2(X, y, groups, MAX_GROUP_FOLDS)
rows.append({
"model":"features_new.csv",
"window": w,
"rows_used": len(y),
"features_used": X.shape[1],
"r_squared": r2_in,
"adjusted_r_squared": adj_in,
"cross_validated_r_squared": cv_out
})
results = pd.DataFrame(rows)
display(results)
# Save
out_csv = DATA_DIR / "features_new_metrics.csv"
results.to_csv(out_csv, index=False)
print("Saved:", out_csv)
# ---- Quick bars ----
def make_bar(metric, title):
fig = plt.figure(figsize=(8,5))
vals = [float(results.loc[results.window==w, metric]) if (results.window==w).any() else np.nan for w in WINDOWS]
plt.bar(range(len(WINDOWS)), vals)
plt.xticks(range(len(WINDOWS)), WINDOWS)
plt.ylabel(title)
plt.title(f"{title} — features_new.csv")
plt.tight_layout()
plt.show()
make_bar("adjusted_r_squared", "Adjusted coefficient of determination")
make_bar("cross_validated_r_squared", "Cross validated coefficient of determination")
C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\3626808198.py:63: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning. a = pd.to_datetime(s, errors="coerce").dt.normalize()
| model | window | rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | |
|---|---|---|---|---|---|---|---|
| 0 | features_new.csv | 0,1 | 129 | 27 | 0.832782 | 0.788080 | 0.580178 |
| 1 | features_new.csv | 0,3 | 129 | 27 | 0.743844 | 0.675366 | 0.469311 |
| 2 | features_new.csv | 0,5 | 129 | 27 | 0.705773 | 0.627118 | 0.375565 |
C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\3626808198.py:179: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead vals = [float(results.loc[results.window==w, metric]) if (results.window==w).any() else np.nan for w in WINDOWS]
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\features_new_metrics.csv
C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\3626808198.py:179: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead vals = [float(results.loc[results.window==w, metric]) if (results.window==w).any() else np.nan for w in WINDOWS]
In [3]:
# === Which features push cross validated R squared up or down (27-feature set) ===
# Paths
from pathlib import Path
DATA_DIR = Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model")
EVENT_FILE = DATA_DIR / "event_study.xlsx"
FEATURES_FILE = DATA_DIR / "features_new.csv" # your file with ~27 features
WINDOWS = ["0,5"] # change to ["0,1","0,3","0,5"] if you want all
# ------------------- Imports -------------------
import re, numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold, KFold
from sklearn.linear_model import LinearRegression
# ------------------- Small helpers -------------------
def is_readme(name):
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))
def window_sheets(book):
out = {"0,1":None,"0,3":None,"0,5":None}
pats = {"0,1":r"(car.*)?0\D*1(?!\d)","0,3":r"(car.*)?0\D*3(?!\d)","0,5":r"(car.*)?0\D*5(?!\d)"}
for nm in book:
if is_readme(nm): continue
for w,pat in pats.items():
if out[w] is None and re.search(pat, str(nm), re.I): out[w]=nm
return out
def find_day0(df):
s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
if s: return s[0]
for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
"date","Date","trading_date","TradingDate","date0","Date0","DATE0"]:
if c in df.columns: return c
best,k=None,-1
for c in df.columns:
kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
if kk>k: best,k=c,kk
return best
def find_ticker(df):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best,score=None,-1
for c in obj:
s=df[c].astype(str).str.strip()
sc=s.nunique() - 0.1*s.str.len().mean()
if sc>score: best,score=c,sc
return best
def find_target(df):
c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
if c: return c[0]
c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
return c[0] if c else None
def norm_day0(s):
a=pd.to_datetime(s, errors="coerce").dt.normalize()
b=pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def norm_tic(s):
return s.astype(str).str.strip().str.upper()
def group_numeric_by_keys(df, dcol, tcol):
g=df.copy()
g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
nums=g.select_dtypes(include=[np.number]).columns.tolist()
g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
.dropna(subset=["__day0__","__tic__"]))
return g, nums
def adjusted_r2(n, p, r2):
return np.nan if n-p-1<=0 else 1 - (1-r2)*(n-1)/(n-p-1)
# Winsorise and standardise inside train only
def fit_transformers(Xtr):
stats={}
Xw=Xtr.copy()
for c in Xw.columns:
lo, hi = np.nanpercentile(Xw[c].values, [1,99])
stats[c] = {"lo":float(lo), "hi":float(hi),
"mean":float(np.nanmean(Xw[c].clip(lo,hi))),
"std": float(np.nanstd(Xw[c].clip(lo,hi), ddof=0))}
Xw[c] = Xw[c].clip(lo,hi)
sd = stats[c]["std"] or 1.0
Xw[c] = (Xw[c] - stats[c]["mean"]) / sd
return stats, Xw
def apply_transformers(Xte, stats):
Xw=Xte.copy()
for c in Xw.columns:
if c not in stats: continue
lo,hi,mu,sd = stats[c]["lo"], stats[c]["hi"], stats[c]["mean"], stats[c]["std"] or 1.0
Xw[c]=((Xw[c].clip(lo,hi) - mu) / sd)
return Xw
def grouped_splits(X, y, groups, max_folds=5):
ng=int(pd.Series(groups).nunique())
if ng>=2:
return list(GroupKFold(n_splits=min(max_folds, ng)).split(X,y,groups))
k=min(3, len(X))
return list(KFold(n_splits=k, shuffle=True, random_state=42).split(X,y))
def fold_score(Xtr, ytr, Xte, yte):
stats, Xtr_s = fit_transformers(Xtr)
Xte_s = apply_transformers(Xte, stats)
m=LinearRegression().fit(Xtr_s.values, ytr.values)
pred=m.predict(Xte_s.values)
ss_res=np.sum((yte.values - pred)**2)
ss_tot=np.sum((yte.values - yte.values.mean())**2)
return (1 - ss_res/ss_tot) if ss_tot>0 else np.nan, m, stats
def cv_r2(X, y, groups, splits):
scores=[]
for tr,te in splits:
r2,_,_ = fold_score(X.iloc[tr], y.iloc[tr], X.iloc[te], y.iloc[te])
scores.append(r2)
return float(np.nanmean(scores))
# ------------------- Load -------------------
evt_book = pd.read_excel(EVENT_FILE, sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)
feat_raw = pd.read_csv(FEATURES_FILE)
dcol, tcol = find_day0(feat_raw), find_ticker(feat_raw)
feat_g, numeric_cols = group_numeric_by_keys(feat_raw, dcol, tcol)
# ------------------- Run per window -------------------
all_summaries = []
for w in WINDOWS:
es = win_map[w]
if es is None:
print(f"Window {w}: not found in event_study.xlsx; skipping.")
continue
ev = evt_book[es].copy()
ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])
merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")
# Build X and y
X = merged[numeric_cols].drop(columns=[c for c in ["CAR","car",ycol] if c in numeric_cols], errors="ignore")
# Drop constant columns
nunq = X.nunique(dropna=False)
X = X.loc[:, nunq > 1]
y = merged[ycol].astype(float)
groups = merged["__tic__"]
# Cross validated baseline
splits = grouped_splits(X, y, groups, max_folds=5)
base_cv = cv_r2(X, y, groups, splits)
# Leave-one-feature-out delta (positive = helpful; negative = harmful)
lofo = {}
for f in X.columns:
cv_without = cv_r2(X.drop(columns=[f]), y, groups, splits)
lofo[f] = base_cv - cv_without
# Permutation drop in test per fold (bigger drop = more important)
perm = {f: [] for f in X.columns}
for tr, te in splits:
r2_base, m, stats = fold_score(X.iloc[tr], y.iloc[tr], X.iloc[te], y.iloc[te])
if np.isnan(r2_base):
for f in X.columns: perm[f].append(np.nan)
continue
Xte = X.iloc[te].copy()
for f in X.columns:
Xperm = Xte.copy()
Xperm[f] = np.random.permutation(Xperm[f].values)
# score again using same training split
r2_perm, _, _ = fold_score(X.iloc[tr], y.iloc[tr], Xperm, y.iloc[te])
perm[f].append(r2_base - r2_perm)
perm_mean = {f: float(np.nanmean(v)) for f,v in perm.items()}
imp = pd.DataFrame({
"feature": X.columns,
"delta_cv_r_squared_when_dropped": [lofo[f] for f in X.columns],
"permutation_drop_in_test_r_squared": [perm_mean[f] for f in X.columns]
})
# Rankings (1 = most helpful/important)
imp["rank_lofo"] = imp["delta_cv_r_squared_when_dropped"].rank(ascending=False, method="min")
imp["rank_perm"] = imp["permutation_drop_in_test_r_squared"].rank(ascending=False, method="min")
imp["aggregate_rank"] = imp[["rank_lofo","rank_perm"]].mean(axis=1)
imp = imp.sort_values("aggregate_rank").reset_index(drop=True)
# Labels for action
imp["action"] = np.where(
imp["delta_cv_r_squared_when_dropped"] < 0,
"candidate_to_drop (model improves when removed)",
"keep_or_review"
)
out_csv = DATA_DIR / f"feature_impact_window_{w.replace(',','_')}.csv"
imp.to_csv(out_csv, index=False)
print(f"\n=== Window {w} ===")
print(f"Baseline cross validated R squared with all features: {base_cv:.4f}")
print("\nTop 10 to KEEP (largest positive delta when dropped and large permutation drop):")
display(imp.sort_values(["delta_cv_r_squared_when_dropped","permutation_drop_in_test_r_squared"], ascending=False).head(10))
print("\nTop 10 to DROP (negative delta when dropped and low/negative permutation drop):")
display(imp.sort_values(["delta_cv_r_squared_when_dropped","permutation_drop_in_test_r_squared"], ascending=[True, True]).head(10))
print("Saved:", out_csv)
all_summaries.append(imp.assign(window=w))
# Combined table (if you ran more than one window)
if all_summaries:
combined = pd.concat(all_summaries, ignore_index=True)
combined_path = DATA_DIR / "feature_impact_all_windows.csv"
combined.to_csv(combined_path, index=False)
print("Combined table saved:", combined_path)
C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\795902831.py:57: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning. a=pd.to_datetime(s, errors="coerce").dt.normalize()
=== Window 0,5 === Baseline cross validated R squared with all features: 0.3431 Top 10 to KEEP (largest positive delta when dropped and large permutation drop):
| feature | delta_cv_r_squared_when_dropped | permutation_drop_in_test_r_squared | rank_lofo | rank_perm | aggregate_rank | action | |
|---|---|---|---|---|---|---|---|
| 0 | gap_proxy_dm1_to_d0 | 6.613756e-01 | 1.588933 | 1.0 | 3.0 | 2.0 | keep_or_review |
| 7 | pre_ret_3d | 3.735284e-02 | 0.038858 | 2.0 | 17.0 | 9.5 | keep_or_review |
| 6 | mkt_ret_1d_lag1 | 1.763603e-02 | 0.040034 | 3.0 | 16.0 | 9.5 | keep_or_review |
| 13 | pre_ret_10d | 1.043617e-02 | 0.007743 | 4.0 | 21.0 | 12.5 | keep_or_review |
| 15 | vix_level_lag1 | 6.095099e-03 | -0.011713 | 5.0 | 27.0 | 16.0 | keep_or_review |
| 3 | pre_ret_5d | 1.741542e-03 | 0.170232 | 6.0 | 7.0 | 6.5 | keep_or_review |
| 1 | credit_moody_baa_yield_pct | 1.669870e-03 | 13.515042 | 7.0 | 1.0 | 4.0 | keep_or_review |
| 2 | credit_moody_aaa_yield_pct | 1.061318e-03 | 11.991609 | 8.0 | 2.0 | 5.0 | keep_or_review |
| 4 | credit_baa_minus_aaa_bp | 0.000000e+00 | 0.132807 | 9.0 | 8.0 | 8.5 | keep_or_review |
| 5 | credit_investment_grade_option_adjusted_spread... | -1.665335e-16 | 0.132769 | 10.0 | 9.0 | 9.5 | candidate_to_drop (model improves when removed) |
Top 10 to DROP (negative delta when dropped and low/negative permutation drop):
| feature | delta_cv_r_squared_when_dropped | permutation_drop_in_test_r_squared | rank_lofo | rank_perm | aggregate_rank | action | |
|---|---|---|---|---|---|---|---|
| 26 | macro_fedfunds | -0.032414 | 0.021119 | 27.0 | 19.0 | 23.0 | candidate_to_drop (model improves when removed) |
| 23 | mkt_ret_5d_lag1 | -0.031801 | 0.037368 | 26.0 | 18.0 | 22.0 | candidate_to_drop (model improves when removed) |
| 18 | mkt_ret_10d_lag1 | -0.028707 | 0.046340 | 25.0 | 14.0 | 19.5 | candidate_to_drop (model improves when removed) |
| 25 | pre_vol_3d | -0.020214 | 0.005819 | 24.0 | 22.0 | 23.0 | candidate_to_drop (model improves when removed) |
| 21 | eps_surprise_pct | -0.020014 | 0.016920 | 23.0 | 20.0 | 21.5 | candidate_to_drop (model improves when removed) |
| 17 | pre_vol_5d | -0.016368 | 0.050449 | 22.0 | 13.0 | 17.5 | candidate_to_drop (model improves when removed) |
| 16 | macro_cpi_yoy | -0.015780 | 0.128479 | 21.0 | 11.0 | 16.0 | candidate_to_drop (model improves when removed) |
| 24 | vix_chg_10d_lag1 | -0.008765 | 0.000686 | 20.0 | 24.0 | 22.0 | candidate_to_drop (model improves when removed) |
| 20 | vix_chg_5d_lag1 | -0.008057 | 0.001458 | 19.0 | 23.0 | 21.0 | candidate_to_drop (model improves when removed) |
| 22 | pre_vol_10d | -0.007810 | -0.002316 | 18.0 | 25.0 | 21.5 | candidate_to_drop (model improves when removed) |
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\feature_impact_window_0_5.csv Combined table saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\feature_impact_all_windows.csv
In [5]:
# === Compare v2.1.csv vs v2.2.csv vs v2.3.csv on CAR (0,1) (0,3) (0,5) ===
# Requirements: pandas, numpy, scikit-learn, openpyxl, matplotlib
import re, numpy as np, pandas as pd
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
import matplotlib.pyplot as plt
# ---------- Paths (tries your Windows folder first, then /mnt/data) ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
Path("/mnt/data"),
Path(".")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["v2.1.csv", "v2.2.csv", "v2.3.csv"] # put more here if needed
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5
def find_file(name):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(name)
# ---------- Helpers ----------
def is_readme(name):
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))
def window_sheets(book):
out = {"0,1":None,"0,3":None,"0,5":None}
pats = {"0,1":r"(car.*)?0\D*1(?!\d)","0,3":r"(car.*)?0\D*3(?!\d)","0,5":r"(car.*)?0\D*5(?!\d)"}
for nm in book:
if is_readme(nm): continue
for w,pat in pats.items():
if out[w] is None and re.search(pat, str(nm), re.I):
out[w] = nm
return out
def find_day0(df):
s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
if s: return s[0]
for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
"date","Date","trading_date","TradingDate","date0","Date0","DATE0"]:
if c in df.columns: return c
# fallback: most date-like
best,k=None,-1
for c in df.columns:
kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
if kk>k: best,k=c,kk
return best
def find_ticker(df):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best,score=None,-1
for c in obj:
s=df[c].astype(str).str.strip()
sc=s.nunique() - 0.1*s.str.len().mean()
if sc>score: best,score=c,sc
return best
def find_target(df):
c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
if c: return c[0]
c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
return c[0] if c else None
def norm_day0(s):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def norm_tic(s):
return s.astype(str).str.strip().str.upper()
def group_numeric(df, dcol, tcol):
g=df.copy()
g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
nums=g.select_dtypes(include=[np.number]).columns.tolist()
g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
.dropna(subset=["__day0__","__tic__"]))
return g, nums
def build_X(merged, numeric_cols, ycol):
keep=[c for c in numeric_cols if c in merged.columns]
X=merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
nunq=X.nunique(dropna=False)
return X.loc[:, nunq>1]
def adjusted_r2(X, y, r2_value):
n, p = len(y), X.shape[1]
if n-p-1 <= 0: return np.nan
return 1 - (1 - r2_value) * (n - 1) / (n - p - 1)
def grouped_cv_r2(X, y, groups):
n_groups = int(pd.Series(groups).nunique())
if n_groups >= 2:
gkf = GroupKFold(n_splits=min(MAX_GROUP_FOLDS, n_groups))
splits = gkf.split(X, y, groups=groups)
else:
splits = KFold(n_splits=min(3,len(X)), shuffle=True, random_state=42).split(X,y)
scores=[]
for tr, te in splits:
m = LinearRegression().fit(X.iloc[tr].values, y.iloc[tr].values)
yh = m.predict(X.iloc[te].values); yt = y.iloc[te].values
ss_res = np.sum((yt - yh)**2); ss_tot = np.sum((yt - yt.mean())**2)
scores.append(1 - ss_res/ss_tot if ss_tot>0 else np.nan)
return float(np.nanmean(scores))
# ---------- Load event study and map windows ----------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)
# ---------- Score each features file on each window ----------
rows = []
for f in FEATURE_FILES:
fpath = find_file(f)
feat_raw = pd.read_csv(fpath)
dcol, tcol = find_day0(feat_raw), find_ticker(feat_raw)
feat_g, numeric_cols = group_numeric(feat_raw, dcol, tcol)
for w in WINDOWS:
es = win_map[w]
if es is None:
print(f"[{f}] window {w} sheet not found. Skipping.")
continue
ev = evt_book[es].copy()
ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])
merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")
X = build_X(merged, numeric_cols, ycol)
y = merged[ycol].astype(float)
groups = merged["__tic__"]
if X.shape[1] == 0 or len(y) < max(10, X.shape[1] + 2):
rows.append({"model":f, "window":w, "rows_used":len(y), "features_used":X.shape[1],
"r_squared":np.nan, "adjusted_r_squared":np.nan,
"cross_validated_r_squared":np.nan})
continue
lr = LinearRegression().fit(X.values, y.values)
r2_in = float(lr.score(X.values, y.values))
adj_in = float(adjusted_r2(X, y, r2_in))
cv_out = grouped_cv_r2(X, y, groups)
rows.append({"model":f, "window":w, "rows_used":len(y), "features_used":X.shape[1],
"r_squared":r2_in, "adjusted_r_squared":adj_in,
"cross_validated_r_squared":cv_out})
# ---------- Results table ----------
results = pd.DataFrame(rows)
results = results.sort_values(["window","cross_validated_r_squared"], ascending=[True, False]).reset_index(drop=True)
display(results)
# Save
out_path = find_file(EVENT_FILE).parent / "v2_models_metrics.csv"
results.to_csv(out_path, index=False)
print("Saved:", out_path)
# ---------- Quick visual: cross validated R squared by window ----------
plt.figure(figsize=(9,5))
for f in FEATURE_FILES:
sub = results[results.model==f]
plt.plot(sub["window"], sub["cross_validated_r_squared"], marker="o", label=f)
plt.title("Cross validated R squared — v2 models")
plt.xlabel("Window"); plt.ylabel("Cross validated R squared")
plt.legend(); plt.tight_layout(); plt.show()
# Also show the winner per window
best = results.sort_values(["window","cross_validated_r_squared"], ascending=[True,False])\
.groupby("window").head(1).reset_index(drop=True)
print("\nBest model per window (by cross validated R squared):")
display(best[["window","model","cross_validated_r_squared","adjusted_r_squared","rows_used","features_used"]])
C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\1394943956.py:72: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning. a = pd.to_datetime(s, errors="coerce").dt.normalize() C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\1394943956.py:72: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning. a = pd.to_datetime(s, errors="coerce").dt.normalize() C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\1394943956.py:72: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning. a = pd.to_datetime(s, errors="coerce").dt.normalize()
| model | window | rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | |
|---|---|---|---|---|---|---|---|
| 0 | v2.1.csv | 0,1 | 129 | 7 | 0.799498 | 0.787898 | 0.707091 |
| 1 | v2.2.csv | 0,1 | 129 | 6 | 0.779788 | 0.768958 | 0.616111 |
| 2 | v2.3.csv | 0,1 | 129 | 13 | 0.804145 | 0.782005 | 0.575099 |
| 3 | v2.1.csv | 0,3 | 129 | 7 | 0.716792 | 0.700408 | 0.630746 |
| 4 | v2.2.csv | 0,3 | 129 | 6 | 0.709632 | 0.695351 | 0.576169 |
| 5 | v2.3.csv | 0,3 | 129 | 13 | 0.730299 | 0.699811 | 0.541791 |
| 6 | v2.1.csv | 0,5 | 129 | 7 | 0.681290 | 0.662853 | 0.556400 |
| 7 | v2.2.csv | 0,5 | 129 | 6 | 0.666807 | 0.650420 | 0.491290 |
| 8 | v2.3.csv | 0,5 | 129 | 13 | 0.699674 | 0.665724 | 0.480121 |
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\v2_models_metrics.csv
Best model per window (by cross validated R squared):
| window | model | cross_validated_r_squared | adjusted_r_squared | rows_used | features_used | |
|---|---|---|---|---|---|---|
| 0 | 0,1 | v2.1.csv | 0.707091 | 0.787898 | 129 | 7 |
| 1 | 0,3 | v2.1.csv | 0.630746 | 0.700408 | 129 | 7 |
| 2 | 0,5 | v2.1.csv | 0.556400 | 0.662853 | 129 | 7 |
In [7]:
# === Compare v2.1.1.csv vs v2.1.csv on CAR_(0,1)/(0,3)/(0,5) ===
# Needs: pandas, numpy, scikit-learn, openpyxl, matplotlib
import re, numpy as np, pandas as pd
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
import matplotlib.pyplot as plt
# ----- Paths (edit if needed) -----
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
Path("/mnt/data"),
Path(".")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["v2.1.1.csv", "v2.1.csv"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5
def find_file(name):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(name)
# ----- Helpers -----
def is_readme(name):
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))
def window_sheets(book):
out = {"0,1":None,"0,3":None,"0,5":None}
pats = {"0,1":r"(car.*)?0\D*1(?!\d)","0,3":r"(car.*)?0\D*3(?!\d)","0,5":r"(car.*)?0\D*5(?!\d)"}
for nm in book:
if is_readme(nm): continue
for w,pat in pats.items():
if out[w] is None and re.search(pat, str(nm), re.I): out[w]=nm
return out
def find_day0(df):
s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
if s: return s[0]
for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
"date","Date","trading_date","TradingDate","date0","Date0","DATE0"]:
if c in df.columns: return c
best,k=None,-1
for c in df.columns:
kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
if kk>k: best,k=c,kk
return best
def find_ticker(df):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best,score=None,-1
for c in obj:
s=df[c].astype(str).str.strip()
sc=s.nunique() - 0.1*s.str.len().mean()
if sc>score: best,score=c,sc
return best
def find_target(df):
c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
if c: return c[0]
c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
return c[0] if c else None
def norm_day0(s):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def norm_tic(s):
return s.astype(str).str.strip().str.upper()
def group_numeric(df, dcol, tcol):
g=df.copy()
g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
nums=g.select_dtypes(include=[np.number]).columns.tolist()
g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
.dropna(subset=["__day0__","__tic__"]))
return g, nums
def build_X(merged, numeric_cols, ycol):
keep=[c for c in numeric_cols if c in merged.columns]
X=merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
nunq=X.nunique(dropna=False)
return X.loc[:, nunq>1]
def adjusted_r2(X, y, r2_value):
n, p = len(y), X.shape[1]
if n-p-1 <= 0: return np.nan
return 1 - (1 - r2_value) * (n - 1) / (n - p - 1)
def grouped_cv_r2(X, y, groups, max_folds=5):
n_groups = int(pd.Series(groups).nunique())
if len(X) < 3: return np.nan
if n_groups >= 2:
splits = GroupKFold(n_splits=min(max_folds, n_groups)).split(X, y, groups=groups)
else:
splits = KFold(n_splits=min(3,len(X)), shuffle=True, random_state=42).split(X, y)
scores=[]
for tr, te in splits:
m = LinearRegression().fit(X.iloc[tr].values, y.iloc[tr].values)
yh = m.predict(X.iloc[te].values); yt = y.iloc[te].values
ss_res = np.sum((yt - yh)**2); ss_tot = np.sum((yt - yt.mean())**2)
scores.append(1 - ss_res/ss_tot if ss_tot>0 else np.nan)
return float(np.nanmean(scores))
# ----- Load event study -----
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)
# ----- Score both files -----
rows = []
for f in FEATURE_FILES:
fpath = find_file(f)
feat_raw = pd.read_csv(fpath)
dcol, tcol = find_day0(feat_raw), find_ticker(feat_raw)
feat_g, numeric_cols = group_numeric(feat_raw, dcol, tcol)
for w in WINDOWS:
es = win_map[w]
if es is None:
print(f"[{f}] window {w} sheet not found. Skipping.")
continue
ev = evt_book[es].copy()
ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])
merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")
X = build_X(merged, numeric_cols, ycol)
y = merged[ycol].astype(float)
groups = merged["__tic__"]
if X.shape[1] == 0 or len(y) < max(10, X.shape[1] + 2):
rows.append({"model":f, "window":w, "rows_used":len(y), "features_used":X.shape[1],
"r_squared":np.nan, "adjusted_r_squared":np.nan,
"cross_validated_r_squared":np.nan})
continue
lr = LinearRegression().fit(X.values, y.values)
r2_in = float(lr.score(X.values, y.values))
adj_in = float(adjusted_r2(X, y, r2_in))
cv_out = grouped_cv_r2(X, y, groups, MAX_GROUP_FOLDS)
rows.append({"model":f, "window":w, "rows_used":len(y), "features_used":X.shape[1],
"r_squared":r2_in, "adjusted_r_squared":adj_in,
"cross_validated_r_squared":cv_out})
# ----- Results table -----
results = pd.DataFrame(rows).sort_values(["window","cross_validated_r_squared"], ascending=[True, False]).reset_index(drop=True)
display(results)
# Save
out_path = find_file(EVENT_FILE).parent / "v2_1_1_vs_v2_1_metrics.csv"
results.to_csv(out_path, index=False)
print("Saved:", out_path)
# ----- Quick plot: cross validated R^2 by window -----
plt.figure(figsize=(8,5))
for f in FEATURE_FILES:
sub = results[results.model==f]
plt.plot(sub["window"], sub["cross_validated_r_squared"], marker="o", label=f)
plt.title("Cross validated R squared — v2.1.1 vs v2.1")
plt.xlabel("Window"); plt.ylabel("Cross validated R squared")
plt.legend(); plt.tight_layout(); plt.show()
# Winner per window
best = results.sort_values(["window","cross_validated_r_squared"], ascending=[True,False]).groupby("window").head(1)
print("\nBest per window:")
display(best[["window","model","cross_validated_r_squared","adjusted_r_squared","rows_used","features_used"]])
C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\832927390.py:70: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning. a = pd.to_datetime(s, errors="coerce").dt.normalize() C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\832927390.py:70: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning. a = pd.to_datetime(s, errors="coerce").dt.normalize()
| model | window | rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | |
|---|---|---|---|---|---|---|---|
| 0 | v2.1.csv | 0,1 | 129 | 7 | 0.799498 | 0.787898 | 0.707091 |
| 1 | v2.1.1.csv | 0,1 | 129 | 6 | 0.185055 | 0.144976 | 0.042623 |
| 2 | v2.1.csv | 0,3 | 129 | 7 | 0.716792 | 0.700408 | 0.630746 |
| 3 | v2.1.1.csv | 0,3 | 129 | 6 | 0.142172 | 0.099984 | 0.046614 |
| 4 | v2.1.csv | 0,5 | 129 | 7 | 0.681290 | 0.662853 | 0.556400 |
| 5 | v2.1.1.csv | 0,5 | 129 | 6 | 0.149269 | 0.107430 | 0.073914 |
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\v2_1_1_vs_v2_1_metrics.csv
Best per window:
| window | model | cross_validated_r_squared | adjusted_r_squared | rows_used | features_used | |
|---|---|---|---|---|---|---|
| 0 | 0,1 | v2.1.csv | 0.707091 | 0.787898 | 129 | 7 |
| 2 | 0,3 | v2.1.csv | 0.630746 | 0.700408 | 129 | 7 |
| 4 | 0,5 | v2.1.csv | 0.556400 | 0.662853 | 129 | 7 |
In [1]:
# === Compare v1.2.xlsx vs v1.3.xlsx on CAR (0,1) (0,3) (0,5) ===
# Needs: pandas, numpy, scikit-learn, openpyxl, matplotlib
import re, numpy as np, pandas as pd
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
import matplotlib.pyplot as plt
# ---------- Paths (tries your Windows folder first, then /mnt/data) ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
Path("/mnt/data"),
Path(".")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["v1.2.xlsx", "v1.3.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5
def find_file(name):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(name)
# ---------- Helpers ----------
def is_readme(name):
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))
def window_sheets(book):
out = {"0,1":None,"0,3":None,"0,5":None}
pats = {"0,1":r"(car.*)?0\D*1(?!\d)","0,3":r"(car.*)?0\D*3(?!\d)","0,5":r"(car.*)?0\D*5(?!\d)"}
for nm in book:
if is_readme(nm): continue
for w,pat in pats.items():
if out[w] is None and re.search(pat, str(nm), re.I): out[w]=nm
return out
def choose_features_sheet(book):
cands = [(n, df) for n, df in book.items() if not is_readme(n)]
if not cands: return next(iter(book))
def score(x):
_, df = x
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_day0(df):
s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
if s: return s[0]
for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
"date","Date","trading_date","TradingDate","date0","Date0","DATE0"]:
if c in df.columns: return c
best,k=None,-1
for c in df.columns:
kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
if kk>k: best,k=c,kk
return best
def find_ticker(df):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best,score=None,-1
for c in obj:
s=df[c].astype(str).str.strip()
sc=s.nunique() - 0.1*s.str.len().mean()
if sc>score: best,score=c,sc
return best
def find_target(df):
c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
if c: return c[0]
c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
return c[0] if c else None
def norm_day0(s):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def norm_tic(s):
return s.astype(str).str.strip().str.upper()
def group_numeric(df, dcol, tcol):
g=df.copy()
g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
nums=g.select_dtypes(include=[np.number]).columns.tolist()
g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
.dropna(subset=["__day0__","__tic__"]))
return g, nums
def build_X(merged, numeric_cols, ycol):
keep=[c for c in numeric_cols if c in merged.columns]
X=merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
nunq=X.nunique(dropna=False)
return X.loc[:, nunq>1]
def adjusted_r2(X, y, r2_value):
n, p = len(y), X.shape[1]
if n-p-1 <= 0: return np.nan
return 1 - (1 - r2_value) * (n - 1) / (n - p - 1)
def grouped_cv_r2(X, y, groups):
n_groups = int(pd.Series(groups).nunique())
if len(X) < 3: return np.nan
if n_groups >= 2:
splits = GroupKFold(n_splits=min(MAX_GROUP_FOLDS, n_groups)).split(X, y, groups=groups)
else:
splits = KFold(n_splits=min(3,len(X)), shuffle=True, random_state=42).split(X, y)
scores=[]
for tr, te in splits:
m = LinearRegression().fit(X.iloc[tr].values, y.iloc[tr].values)
yh = m.predict(X.iloc[te].values); yt = y.iloc[te].values
ss_res = np.sum((yt - yh)**2); ss_tot = np.sum((yt - yt.mean())**2)
scores.append(1 - ss_res/ss_tot if ss_tot>0 else np.nan)
return float(np.nanmean(scores))
# ---------- Load event study and map windows ----------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)
# ---------- Score each file on each window ----------
rows = []
for f in FEATURE_FILES:
fpath = find_file(f)
feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
raw = feat_book[fsheet].copy()
dcol, tcol = find_day0(raw), find_ticker(raw)
feat_g, numeric_cols = group_numeric(raw, dcol, tcol)
for w in WINDOWS:
es = win_map[w]
if es is None:
print(f"[{f}] window {w} sheet not found. Skipping.")
continue
ev = evt_book[es].copy()
ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])
merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")
X = build_X(merged, numeric_cols, ycol)
y = merged[ycol].astype(float)
groups = merged["__tic__"]
if X.shape[1] == 0 or len(y) < max(10, X.shape[1] + 2):
rows.append({"model":f, "window":w, "rows_used":len(y), "features_used":X.shape[1],
"r_squared":np.nan, "adjusted_r_squared":np.nan,
"cross_validated_r_squared":np.nan})
continue
lr = LinearRegression().fit(X.values, y.values)
r2_in = float(lr.score(X.values, y.values))
adj_in = float(adjusted_r2(X, y, r2_in))
cv_out = grouped_cv_r2(X, y, groups)
rows.append({"model":f, "window":w, "rows_used":len(y), "features_used":X.shape[1],
"r_squared":r2_in, "adjusted_r_squared":adj_in,
"cross_validated_r_squared":cv_out})
# ---------- Results table ----------
results = pd.DataFrame(rows)
results = results.sort_values(["window","cross_validated_r_squared"], ascending=[True, False]).reset_index(drop=True)
display(results)
# Save
out_path = find_file(EVENT_FILE).parent / "v1_2_vs_v1_3_metrics.csv"
results.to_csv(out_path, index=False)
print("Saved:", out_path)
# ---------- Quick plot: cross validated coefficient of determination by window ----------
plt.figure(figsize=(8,5))
for f in FEATURE_FILES:
sub = results[results.model==f]
plt.plot(sub["window"], sub["cross_validated_r_squared"], marker="o", label=f)
plt.title("Cross validated coefficient of determination — v1.2 vs v1.3")
plt.xlabel("Window"); plt.ylabel("Cross validated coefficient of determination")
plt.legend(); plt.tight_layout(); plt.show()
# Winner per window
best = (results.sort_values(["window","cross_validated_r_squared"], ascending=[True,False])
.groupby("window").head(1).reset_index(drop=True))
print("\nBest per window:")
display(best[["window","model","cross_validated_r_squared","adjusted_r_squared","rows_used","features_used"]])
| model | window | rows_used | features_used | r_squared | adjusted_r_squared | cross_validated_r_squared | |
|---|---|---|---|---|---|---|---|
| 0 | v1.2.xlsx | 0,1 | 129 | 7 | 0.245005 | 0.201328 | 0.072892 |
| 1 | v1.3.xlsx | 0,1 | 129 | 2 | 0.148360 | 0.134842 | -0.018751 |
| 2 | v1.2.xlsx | 0,3 | 129 | 7 | 0.201430 | 0.155231 | 0.099199 |
| 3 | v1.3.xlsx | 0,3 | 129 | 2 | 0.103903 | 0.089679 | -0.002105 |
| 4 | v1.2.xlsx | 0,5 | 129 | 7 | 0.214615 | 0.169179 | 0.133885 |
| 5 | v1.3.xlsx | 0,5 | 129 | 2 | 0.116031 | 0.102000 | 0.043618 |
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\v1_2_vs_v1_3_metrics.csv
Best per window:
| window | model | cross_validated_r_squared | adjusted_r_squared | rows_used | features_used | |
|---|---|---|---|---|---|---|
| 0 | 0,1 | v1.2.xlsx | 0.072892 | 0.201328 | 129 | 7 |
| 1 | 0,3 | v1.2.xlsx | 0.099199 | 0.155231 | 129 | 7 |
| 2 | 0,5 | v1.2.xlsx | 0.133885 | 0.169179 | 129 | 7 |
In [7]:
# === FIXED: Time-aware cross validation for v1.2.xlsx vs v1.3.xlsx (0,1 / 0,3 / 0,5) ===
# Safe when some windows produce no valid time splits.
from pathlib import Path
import re, numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# ---------------- CONFIG ----------------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
Path("/mnt/data"),
Path(".")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["v1.2.xlsx", "v1.3.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MIN_TRAIN_QUARTERS = 4
BLOCK_SAME_TICKERS = True
WINSOR_PCTS = (1, 99)
SAVE_PLOTS = True
# -------------- HELPERS --------------
def find_file(name):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(name)
def is_readme(name):
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))
def choose_features_sheet(book):
cands = [(n, df) for n, df in book.items() if not is_readme(n)]
if not cands: return next(iter(book))
def score(item):
_, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def window_sheets(book):
out = {"0,1":None,"0,3":None,"0,5":None}
pats = {"0,1":r"(car.*)?0\D*1(?!\d)", "0,3":r"(car.*)?0\D*3(?!\d)", "0,5":r"(car.*)?0\D*5(?!\d)"}
for nm in book:
if is_readme(nm): continue
for w, pat in pats.items():
if out[w] is None and re.search(pat, str(nm), re.IGNORECASE):
out[w] = nm
return out
def find_day0(df):
s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
if s: return s[0]
for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
best,k=None,-1
for c in df.columns:
kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
if kk>k: best,k=c,kk
return best
def find_ticker(df):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best,score=None,-1
for c in obj:
s=df[c].astype(str).str.strip()
sc=s.nunique() - 0.1*s.str.len().mean()
if sc>score: best,score=c,sc
return best
def find_target(df):
c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
if c: return c[0]
c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
return c[0] if c else None
def norm_day0(s):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def norm_tic(s):
return s.astype(str).str.strip().str.upper()
def group_numeric(df, dcol, tcol):
g=df.copy()
g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
nums=g.select_dtypes(include=[np.number]).columns.tolist()
g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
.dropna(subset=["__day0__","__tic__"]))
return g, nums
def build_X(merged, numeric_cols, ycol):
keep=[c for c in numeric_cols if c in merged.columns]
X=merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
nunq=X.nunique(dropna=False)
return X.loc[:, nunq>1]
def to_quarter(s):
d = pd.to_datetime(s, errors="coerce")
return d.dt.to_period("Q").astype(str)
def fit_transformers(Xtr, lo=1, hi=99):
stats={}
Xw=Xtr.copy()
for c in Xw.columns:
lo_v, hi_v = np.nanpercentile(Xw[c].values, [lo, hi])
clamped = Xw[c].clip(lo_v, hi_v)
mu = float(np.nanmean(clamped))
sd = float(np.nanstd(clamped, ddof=0)) or 1.0
stats[c] = {"lo": float(lo_v), "hi": float(hi_v), "mu": mu, "sd": sd}
Xw[c] = (clamped - mu) / sd
return stats, Xw
def apply_transformers(Xte, stats):
Xw=Xte.copy()
for c in Xw.columns:
if c not in stats: continue
lo_v, hi_v, mu, sd = stats[c]["lo"], stats[c]["hi"], stats[c]["mu"], stats[c]["sd"]
Xw[c] = (Xw[c].clip(lo_v, hi_v) - mu) / sd
return Xw
def test_r2_on_split(Xtr, ytr, Xte, yte):
stats, Xtr_s = fit_transformers(Xtr, lo=WINSOR_PCTS[0], hi=WINSOR_PCTS[1])
Xte_s = apply_transformers(Xte, stats)
m = LinearRegression().fit(Xtr_s.values, ytr.values)
yh = m.predict(Xte_s.values)
ss_res = np.sum((yte.values - yh)**2)
ss_tot = np.sum((yte.values - yte.values.mean())**2)
return (1 - ss_res/ss_tot) if ss_tot>0 else np.nan
def time_splits(df, min_train_quarters=4):
q = to_quarter(df["__day0__"])
uniq = pd.Index(q.unique()).sort_values()
splits=[]
for k in range(min_train_quarters, len(uniq)):
train_q = set(uniq[:k])
test_q = {uniq[k]}
tr_idx = q.isin(train_q).values
te_idx = q.isin(test_q).values
splits.append((np.where(tr_idx)[0], np.where(te_idx)[0], uniq[k]))
return splits
# ---------------- LOAD ----------------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)
out_dir = find_file(EVENT_FILE).parent
all_rows = []
per_split_rows = []
for f in FEATURE_FILES:
fpath = find_file(f)
feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
raw = feat_book[fsheet].copy()
dcol, tcol = find_day0(raw), find_ticker(raw)
feat_g, numeric_cols = group_numeric(raw, dcol, tcol)
for w in WINDOWS:
es = win_map[w]
if es is None:
print(f"[{f}] window {w} sheet not found. Skipping.")
continue
ev = evt_book[es].copy()
ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])
merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner").copy()
if merged.empty:
all_rows.append({"model":f, "window":w, "splits":0, "mean_oos_coefficient_of_determination":np.nan,
"median_oos_coefficient_of_determination":np.nan, "rows_used":0, "features_used":0})
continue
X = build_X(merged, numeric_cols, ycol)
y = merged[ycol].astype(float)
merged = merged.assign(__q__ = to_quarter(merged["__day0__"]))
splits = time_splits(merged, min_train_quarters=MIN_TRAIN_QUARTERS)
split_scores=[]
for tr_idx, te_idx, test_q in splits:
Xtr, ytr = X.iloc[tr_idx], y.iloc[tr_idx]
Xte, yte = X.iloc[te_idx], y.iloc[te_idx]
if BLOCK_SAME_TICKERS:
te_tics = set(merged.iloc[te_idx]["__tic__"])
keep_tr = ~merged.iloc[tr_idx]["__tic__"].isin(te_tics).values
Xtr, ytr = Xtr.iloc[keep_tr], ytr.iloc[keep_tr]
if len(ytr) < X.shape[1] + 2:
continue
if len(ytr)==0 or len(yte)==0:
continue
r2 = test_r2_on_split(Xtr, ytr, Xte, yte)
split_scores.append((str(test_q), float(r2), len(yte)))
per_split_rows.append({
"model": f, "window": w, "test_quarter": str(test_q),
"test_rows": len(yte), "oos_coefficient_of_determination": float(r2)
})
if split_scores:
scores = [s[1] for s in split_scores if np.isfinite(s[1])]
mean_oos = float(np.nanmean(scores)) if scores else np.nan
median_oos = float(np.nanmedian(scores)) if scores else np.nan
all_rows.append({
"model": f, "window": w, "splits": len(split_scores),
"mean_oos_coefficient_of_determination": mean_oos,
"median_oos_coefficient_of_determination": median_oos,
"rows_used": len(X), "features_used": X.shape[1]
})
else:
all_rows.append({
"model": f, "window": w, "splits": 0,
"mean_oos_coefficient_of_determination": np.nan,
"median_oos_coefficient_of_determination": np.nan,
"rows_used": len(X), "features_used": X.shape[1]
})
# ---------------- SAVE RESULTS (robust to empty) ----------------
summary = pd.DataFrame(all_rows).sort_values(
["window","mean_oos_coefficient_of_determination"],
ascending=[True, False]
).reset_index(drop=True)
per_split_cols = ["model","window","test_quarter","test_rows","oos_coefficient_of_determination"]
per_split = pd.DataFrame(per_split_rows, columns=per_split_cols)
if not per_split.empty:
per_split = per_split.sort_values(["window","test_quarter","model"]).reset_index(drop=True)
sum_path = out_dir / "time_cv_v12_vs_v13_summary.csv"
split_path = out_dir / "time_cv_v12_vs_v13_per_quarter.csv"
summary.to_csv(sum_path, index=False)
per_split.to_csv(split_path, index=False)
print("Saved:", sum_path)
print("Saved:", split_path)
# ---------------- PLOTS (only if we have rows) ----------------
if SAVE_PLOTS and not per_split.empty:
for w in WINDOWS:
sub = per_split[per_split.window==w]
if sub.empty:
continue
fig = plt.figure(figsize=(9,5))
for mdl, g in sub.groupby("model"):
qorder = pd.Index(g["test_quarter"].unique()).sort_values()
order_map = {q:i for i,q in enumerate(qorder)}
gg = g.copy()
gg["__ord__"] = gg["test_quarter"].map(order_map)
gg = gg.sort_values("__ord__")
plt.plot(gg["__ord__"], gg["oos_coefficient_of_determination"], marker="o", label=mdl)
plt.title(f"Time-aware out-of-sample coefficient of determination by quarter — window {w}")
plt.xlabel("Quarter (ordered)"); plt.ylabel("Out-of-sample coefficient of determination")
plt.legend(); plt.tight_layout()
png_path = out_dir / f"time_cv_oos_r2_window_{w.replace(',','_')}.png"
plt.savefig(png_path, dpi=150)
plt.show()
print("Saved:", png_path)
# ---------------- PRINT SUMMARY ----------------
print("\nTime-aware cross validation — mean out-of-sample coefficient of determination")
display(summary)
if per_split.empty:
print("Note: No valid per-quarter splits were created (likely not enough history or all splits were skipped after ticker blocking).")
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\time_cv_v12_vs_v13_summary.csv Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\time_cv_v12_vs_v13_per_quarter.csv Time-aware cross validation — mean out-of-sample coefficient of determination
| model | window | splits | mean_oos_coefficient_of_determination | median_oos_coefficient_of_determination | rows_used | features_used | |
|---|---|---|---|---|---|---|---|
| 0 | v1.2.xlsx | 0,1 | 0 | NaN | NaN | 129 | 7 |
| 1 | v1.3.xlsx | 0,1 | 0 | NaN | NaN | 129 | 2 |
| 2 | v1.2.xlsx | 0,3 | 0 | NaN | NaN | 129 | 7 |
| 3 | v1.3.xlsx | 0,3 | 0 | NaN | NaN | 129 | 2 |
| 4 | v1.2.xlsx | 0,5 | 0 | NaN | NaN | 129 | 7 |
| 5 | v1.3.xlsx | 0,5 | 0 | NaN | NaN | 129 | 2 |
Note: No valid per-quarter splits were created (likely not enough history or all splits were skipped after ticker blocking).
In [10]:
# === AAPL feature importance for v1.2 (window 0,5) with time-aware 4-quarter test blocks ===
# Fixes the NaN issue by using multi-quarter test blocks and pooled OOS R^2.
# Joins on day0 + ticker, winsorises/standardises on train only.
from pathlib import Path
import re, numpy as np, pandas as pd
from sklearn.linear_model import LinearRegression
# ---------------- CONFIG ----------------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
Path("/mnt/data"),
Path(".")
]
EVENT_FILE = "event_study.xlsx"
FEATURES_FILE = "v1.2.xlsx"
WINDOW = "0,5"
TICKER = "AAPL"
MIN_TRAIN_QUARTERS = 4 # expanding train must include at least this many quarters
TEST_BLOCK_QUARTERS = 4 # test on N consecutive quarters to get >=2 points per test
STEP_QUARTERS = 1 # slide test window by this many quarters
MIN_TEST_SIZE = 2 # require at least this many rows in a test block
WINSOR_PCTS = (1, 99)
np.random.seed(42)
# ---------------- HELPERS ----------------
def find_file(name):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(name)
def is_readme(name):
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))
def choose_features_sheet(book):
cands = [(n, df) for n, df in book.items() if not is_readme(n)]
if not cands: return next(iter(book))
def score(item):
_, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def window_sheets(book):
out = {"0,1":None,"0,3":None,"0,5":None}
pats = {"0,1":r"(car.*)?0\D*1(?!\d)", "0,3":r"(car.*)?0\D*3(?!\d)", "0,5":r"(car.*)?0\D*5(?!\d)"}
for nm in book:
if is_readme(nm): continue
for w, pat in pats.items():
if out[w] is None and re.search(pat, str(nm), re.IGNORECASE):
out[w] = nm
return out
def find_day0(df):
s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
if s: return s[0]
for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
"date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
if c in df.columns: return c
best,k=None,-1
for c in df.columns:
kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
if kk>k: best,k=c,kk
return best
def find_ticker(df):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best,score=None,-1
for c in obj:
s=df[c].astype(str).str.strip()
sc=s.nunique() - 0.1*s.str.len().mean()
if sc>score: best,score=c,sc
return best
def find_target(df):
c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
if c: return c[0]
c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
return c[0] if c else None
def norm_day0(s):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def norm_tic(s):
return s.astype(str).str.strip().str.upper()
def group_numeric(df, dcol, tcol):
g=df.copy()
g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
nums=g.select_dtypes(include=[np.number]).columns.tolist()
g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
.dropna(subset=["__day0__","__tic__"]))
return g, nums
def build_X(merged, numeric_cols, ycol):
keep=[c for c in numeric_cols if c in merged.columns]
X=merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
nunq=X.nunique(dropna=False)
return X.loc[:, nunq>1]
def to_quarter(s):
d = pd.to_datetime(s, errors="coerce")
return d.dt.to_period("Q").astype(str)
def fit_transformers(Xtr, lo=1, hi=99):
stats={}
Xw=Xtr.copy()
for c in Xw.columns:
lo_v, hi_v = np.nanpercentile(Xw[c].values, [lo, hi])
clamped = Xw[c].clip(lo_v, hi_v)
mu = float(np.nanmean(clamped))
sd = float(np.nanstd(clamped, ddof=0)) or 1.0
stats[c] = {"lo": float(lo_v), "hi": float(hi_v), "mu": mu, "sd": sd}
Xw[c] = (clamped - mu) / sd
return stats, Xw
def apply_transformers(Xte, stats):
Xw=Xte.copy()
for c in Xw.columns:
if c not in stats: continue
lo_v, hi_v, mu, sd = stats[c]["lo"], stats[c]["hi"], stats[c]["mu"], stats[c]["sd"]
Xw[c] = (Xw[c].clip(lo_v, hi_v) - mu) / sd
return Xw
def predict_fold(Xtr, ytr, Xte, yte):
stats, Xtr_s = fit_transformers(Xtr, lo=WINSOR_PCTS[0], hi=WINSOR_PCTS[1])
Xte_s = apply_transformers(Xte, stats)
m = LinearRegression().fit(Xtr_s.values, ytr.values)
yh = m.predict(Xte_s.values)
return yh, m.coef_
def pooled_oos_r2(y_true_all, y_pred_all):
yt = np.asarray(y_true_all)
yp = np.asarray(y_pred_all)
ss_res = np.sum((yt - yp)**2)
ss_tot = np.sum((yt - yt.mean())**2)
return float(1 - ss_res/ss_tot) if ss_tot > 0 else np.nan
def build_time_blocks(df, min_train_q=4, test_block_q=4, step_q=1):
q = to_quarter(df["__day0__"])
uniq = pd.Index(q.unique()).sort_values()
blocks=[]
for start in range(min_train_q, len(uniq)-0):
# train quarters
train_q = set(uniq[:start])
# test block quarters
if start + test_block_q > len(uniq):
break
test_q = set(uniq[start:start+test_block_q])
tr_idx = q.isin(train_q).values
te_idx = q.isin(test_q).values
if tr_idx.sum() >= 1 and te_idx.sum() >= MIN_TEST_SIZE:
blocks.append((np.where(tr_idx)[0], np.where(te_idx)[0],
f"{uniq[start]}..{uniq[start+test_block_q-1]}"))
# step
if step_q > 1:
start += (step_q-1)
return blocks
# ---------------- LOAD AAPL DATA ----------------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)
sheet = win_map.get(WINDOW)
assert sheet is not None, f"Could not find event sheet for window {WINDOW}"
ev = evt_book[sheet].copy()
ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])
ev = ev[ev["__tic__"] == TICKER]
feat_book = pd.read_excel(find_file(FEATURES_FILE), sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
raw = feat_book[fsheet].copy()
dcol, tcol = find_day0(raw), find_ticker(raw)
feat_g, numeric_cols = group_numeric(raw, dcol, tcol)
feat_g = feat_g[feat_g["__tic__"] == TICKER]
merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner").copy()
assert not merged.empty, "No AAPL rows after merge. Check keys/values."
# Build X,y
X_full = build_X(merged, numeric_cols, ycol)
nunq = X_full.nunique(dropna=False)
X_full = X_full.loc[:, nunq > 1]
y_full = merged[ycol].astype(float)
# Diagnostics
q = to_quarter(merged["__day0__"])
print(f"AAPL rows: {len(X_full)} | features: {X_full.shape[1]} | unique quarters: {q.nunique()}")
# Build time-aware blocks
merged = merged.assign(__q__ = q)
blocks = build_time_blocks(merged, min_train_q=MIN_TRAIN_QUARTERS,
test_block_q=TEST_BLOCK_QUARTERS, step_q=STEP_QUARTERS)
assert len(blocks) > 0, "No valid time blocks. Reduce TEST_BLOCK_QUARTERS or MIN_TRAIN_QUARTERS."
# -------- Baseline pooled OOS predictions (all features) --------
y_true_all, y_pred_all = [], []
coef_abs = {f: [] for f in X_full.columns}
for tr, te, label in blocks:
Xtr, ytr = X_full.iloc[tr], y_full.iloc[tr]
Xte, yte = X_full.iloc[te], y_full.iloc[te]
yh, coef = predict_fold(Xtr, ytr, Xte, yte)
y_true_all.extend(yte.tolist())
y_pred_all.extend(yh.tolist())
for f,c in zip(X_full.columns, coef):
coef_abs[f].append(abs(float(c)))
base_oos_r2 = pooled_oos_r2(y_true_all, y_pred_all)
print(f"Baseline pooled OOS R^2 (all features): {base_oos_r2:.4f}")
# -------- Leave-one-feature-out (LOFO) pooled OOS R^2 deltas --------
lofo_delta = {}
for fdrop in X_full.columns:
y_true_all, y_pred_all = [], []
Xm = X_full.drop(columns=[fdrop])
for tr, te, label in blocks:
Xtr, ytr = Xm.iloc[tr], y_full.iloc[tr]
Xte, yte = Xm.iloc[te], y_full.iloc[te]
yh, _ = predict_fold(Xtr, ytr, Xte, yte)
y_true_all.extend(yte.tolist())
y_pred_all.extend(yh.tolist())
oos = pooled_oos_r2(y_true_all, y_pred_all)
lofo_delta[fdrop] = base_oos_r2 - oos # positive = helpful; negative = harmful
# -------- Permutation importance over test blocks --------
perm_drop = {f: [] for f in X_full.columns}
for tr, te, label in blocks:
# train baseline on this block
Xtr, ytr = X_full.iloc[tr], y_full.iloc[tr]
Xte, yte = X_full.iloc[te], y_full.iloc[te]
yh_base, _ = predict_fold(Xtr, ytr, Xte, yte)
r2_base = pooled_oos_r2(yte.values, np.asarray(yh_base))
for f in X_full.columns:
Xperm = Xte.copy()
Xperm[f] = np.random.permutation(Xperm[f].values) # permute within block
yh_perm, _ = predict_fold(Xtr, ytr, Xperm, yte)
r2_perm = pooled_oos_r2(yte.values, np.asarray(yh_perm))
drop = (r2_base - r2_perm) if (np.isfinite(r2_base) and np.isfinite(r2_perm)) else np.nan
perm_drop[f].append(drop)
# -------- Assemble importance table --------
imp = pd.DataFrame({
"feature": list(X_full.columns),
"lofo_delta_oos_r2": [lofo_delta[f] for f in X_full.columns],
"perm_drop_in_test_r2": [float(np.nanmean(perm_drop[f])) for f in X_full.columns],
"mean_abs_std_coef": [float(np.nanmean(coef_abs[f])) for f in X_full.columns],
})
# Ranks (1 = most important)
imp["rank_lofo"] = imp["lofo_delta_oos_r2"].rank(ascending=False, method="min")
imp["rank_perm"] = imp["perm_drop_in_test_r2"].rank(ascending=False, method="min")
imp["rank_coef"] = imp["mean_abs_std_coef"].rank(ascending=False, method="min")
imp["aggregate_rank"] = imp[["rank_lofo","rank_perm","rank_coef"]].mean(axis=1)
imp = imp.sort_values("aggregate_rank").reset_index(drop=True)
# Save
out_dir = find_file(EVENT_FILE).parent
out_csv = out_dir / f"aapl_feature_importance_v12_window_{WINDOW.replace(',','_')}_testblock{TEST_BLOCK_QUARTERS}.csv"
imp.to_csv(out_csv, index=False)
display(imp)
print("Saved:", out_csv)
AAPL rows: 43 | features: 7 | unique quarters: 43 Baseline pooled OOS R^2 (all features): -1.6197
| feature | lofo_delta_oos_r2 | perm_drop_in_test_r2 | mean_abs_std_coef | rank_lofo | rank_perm | rank_coef | aggregate_rank | |
|---|---|---|---|---|---|---|---|---|
| 0 | pre_vol_5d | -0.360582 | 6.514350 | 0.023092 | 4.0 | 1.0 | 2.0 | 2.333333 |
| 1 | eps_surprise_pct | -0.155074 | 1.411248 | 0.021375 | 3.0 | 2.0 | 3.0 | 2.666667 |
| 2 | pre_ret_3d | 0.460483 | -0.172005 | 0.025275 | 1.0 | 7.0 | 1.0 | 3.000000 |
| 3 | mkt_ret_5d_lag1 | 0.238020 | 0.922156 | 0.020129 | 2.0 | 3.0 | 4.0 | 3.000000 |
| 4 | vix_level_lag1 | -0.430675 | 0.627245 | 0.017450 | 6.0 | 4.0 | 5.0 | 5.000000 |
| 5 | vix_chg_5d_lag1 | -0.385391 | 0.200592 | 0.014840 | 5.0 | 6.0 | 6.0 | 5.666667 |
| 6 | macro_us10y | -0.629730 | 0.219502 | 0.014301 | 7.0 | 5.0 | 7.0 | 6.333333 |
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\aapl_feature_importance_v12_window_0_5_testblock4.csv
In [3]:
# === Baseline v1 feature importance on CAR(0,5) ===
# Needs: pandas, numpy, scikit-learn, openpyxl, matplotlib (optional)
import re, numpy as np, pandas as pd
from pathlib import Path
from sklearn.model_selection import GroupKFold, KFold
from sklearn.linear_model import LinearRegression
# ---------- Paths (tries your Windows folder first, then /mnt/data) ----------
BASE_DIRS = [
Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
Path("/mnt/data"),
Path(".")
]
EVENT_FILE = "event_study.xlsx" # targets live here (CAR sheets)
FEATURES_FILE = "Baseline v1.xlsx" # your baseline features
WINDOW = "0,5" # focus window
MAX_GROUP_FOLDS = 5
WINSOR_PCTS = (1, 99)
np.random.seed(42)
def find_file(name):
for b in BASE_DIRS:
p = b / name
if p.exists(): return p
raise FileNotFoundError(name)
def is_readme(name):
return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))
def window_sheets(book):
out = {"0,1":None,"0,3":None,"0,5":None}
pats = {"0,1":r"(car.*)?0\D*1(?!\d)","0,3":r"(car.*)?0\D*3(?!\d)","0,5":r"(car.*)?0\D*5(?!\d)"}
for nm in book:
if is_readme(nm): continue
for w,pat in pats.items():
if out[w] is None and re.search(pat, str(nm), re.I):
out[w]=nm
return out
def choose_features_sheet(book):
cands = [(n, df) for n, df in book.items() if not is_readme(n)]
if not cands: return next(iter(book))
def score(item):
_, df = item
return (df.select_dtypes(include=[np.number]).shape[1], len(df))
return max(cands, key=score)[0]
def find_day0(df):
s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
if s: return s[0]
for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
"date","Date","trading_date","TradingDate","date0","Date0","DATE0"]:
if c in df.columns: return c
# fallback: most date-like
best,k=None,-1
for c in df.columns:
kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
if kk>k: best,k=c,kk
return best
def find_ticker(df):
for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
if c in df.columns: return c
obj = df.select_dtypes(include=["object"]).columns
best,score=None,-1
for c in obj:
s=df[c].astype(str).str.strip()
sc=s.nunique() - 0.1*s.str.len().mean()
if sc>score: best,score=c,sc
return best
def find_target(df):
c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
if c: return c[0]
c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
return c[0] if c else None
def norm_day0(s):
a = pd.to_datetime(s, errors="coerce").dt.normalize()
b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
return b.where(b.notna(), a)
def norm_tic(s):
return s.astype(str).str.strip().str.upper()
def group_numeric(df, dcol, tcol):
g=df.copy()
g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
nums=g.select_dtypes(include=[np.number]).columns.tolist()
g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
.dropna(subset=["__day0__","__tic__"]))
return g, nums
def build_X(merged, numeric_cols, ycol):
keep=[c for c in numeric_cols if c in merged.columns]
X=merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
nunq=X.nunique(dropna=False)
return X.loc[:, nunq>1]
def adjusted_r2(n, p, r2):
return np.nan if n-p-1<=0 else 1 - (1-r2)*(n-1)/(n-p-1)
# train-only winsor + standardise
def fit_transformers(Xtr, lo=1, hi=99):
stats={}
Xw=Xtr.copy()
for c in Xw.columns:
lo_v, hi_v = np.nanpercentile(Xw[c].values, [lo, hi])
clamped = Xw[c].clip(lo_v, hi_v)
mu = float(np.nanmean(clamped))
sd = float(np.nanstd(clamped, ddof=0)) or 1.0
stats[c] = {"lo": float(lo_v), "hi": float(hi_v), "mu": mu, "sd": sd}
Xw[c] = (clamped - mu) / sd
return stats, Xw
def apply_transformers(Xte, stats):
Xw=Xte.copy()
for c in Xw.columns:
if c not in stats: continue
lo_v, hi_v, mu, sd = stats[c]["lo"], stats[c]["hi"], stats[c]["mu"], stats[c]["sd"]
Xw[c] = (Xw[c].clip(lo_v, hi_v) - mu) / sd
return Xw
def fold_score_and_coefs(Xtr, ytr, Xte, yte):
stats, Xtr_s = fit_transformers(Xtr, lo=WINSOR_PCTS[0], hi=WINSOR_PCTS[1])
Xte_s = apply_transformers(Xte, stats)
m = LinearRegression().fit(Xtr_s.values, ytr.values)
yh = m.predict(Xte_s.values)
ss_res = np.sum((yte.values - yh)**2)
ss_tot = np.sum((yte.values - yte.values.mean())**2)
r2 = (1 - ss_res/ss_tot) if ss_tot>0 else np.nan
return r2, m.coef_
def grouped_splits(X, y, groups, max_folds=5):
ng=int(pd.Series(groups).nunique())
if ng>=2:
return list(GroupKFold(n_splits=min(max_folds, ng)).split(X, y, groups))
return list(KFold(n_splits=min(3,len(X)), shuffle=True, random_state=42).split(X, y))
# ---------- Load data ----------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)
sheet = win_map.get(WINDOW)
assert sheet is not None, f"Could not find CAR sheet for window {WINDOW}"
ev = evt_book[sheet].copy()
ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])
feat_book = pd.read_excel(find_file(FEATURES_FILE), sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
raw = feat_book[fsheet].copy()
dcol, tcol = find_day0(raw), find_ticker(raw)
feat_g, numeric_cols = group_numeric(raw, dcol, tcol)
merged = feat_g.merge(ev[["__day0__","__tic__", ycol]],
on=["__day0__","__tic__"], how="inner").copy()
# Build design
X = build_X(merged, numeric_cols, ycol)
y = merged[ycol].astype(float)
groups = merged["__tic__"]
# Baseline CV R^2 with all features
splits = grouped_splits(X, y, groups, MAX_GROUP_FOLDS)
base_scores, coef_abs = [], {f: [] for f in X.columns}
for tr, te in splits:
r2, coef = fold_score_and_coefs(X.iloc[tr], y.iloc[tr], X.iloc[te], y.iloc[te])
base_scores.append(r2)
for f,c in zip(X.columns, coef):
coef_abs[f].append(abs(float(c)))
base_cv_r2 = float(np.nanmean(base_scores)) if base_scores else np.nan
coef_mean = {f: float(np.nanmean(v)) for f,v in coef_abs.items()}
# LOFO Δ CV-R^2
lofo = {}
for f in X.columns:
scores=[]
Xm = X.drop(columns=[f])
for tr, te in splits:
r2, _ = fold_score_and_coefs(Xm.iloc[tr], y.iloc[tr], Xm.iloc[te], y.iloc[te])
scores.append(r2)
cv_without = float(np.nanmean(scores)) if scores else np.nan
lofo[f] = base_cv_r2 - cv_without # + = helpful; - = harmful
# Permutation drop (average over folds)
perm = {f: [] for f in X.columns}
for tr, te in splits:
r2_base, _ = fold_score_and_coefs(X.iloc[tr], y.iloc[tr], X.iloc[te], y.iloc[te])
if not np.isfinite(r2_base):
for f in X.columns: perm[f].append(np.nan)
continue
Xte = X.iloc[te].copy()
for f in X.columns:
Xp = Xte.copy()
Xp[f] = np.random.permutation(Xp[f].values)
r2_perm, _ = fold_score_and_coefs(X.iloc[tr], y.iloc[tr], Xp, y.iloc[te])
perm[f].append(r2_base - r2_perm if np.isfinite(r2_perm) else np.nan)
perm_mean = {f: float(np.nanmean(v)) for f,v in perm.items()}
# Importance table
imp = pd.DataFrame({
"feature": list(X.columns),
"lofo_delta_cv_r2": [lofo[f] for f in X.columns],
"perm_drop_in_test_r2": [perm_mean[f] for f in X.columns],
"mean_abs_std_coef": [coef_mean[f] for f in X.columns],
})
imp["rank_lofo"] = imp["lofo_delta_cv_r2"].rank(ascending=False, method="min")
imp["rank_perm"] = imp["perm_drop_in_test_r2"].rank(ascending=False, method="min")
imp["rank_coef"] = imp["mean_abs_std_coef"].rank(ascending=False, method="min")
imp["aggregate_rank"] = imp[["rank_lofo","rank_perm","rank_coef"]].mean(axis=1)
imp = imp.sort_values("aggregate_rank").reset_index(drop=True)
# Label for action
imp["action"] = np.where(imp["lofo_delta_cv_r2"] < 0,
"candidate_to_drop",
"keep_or_review")
print(f"Rows used: {len(X)} | Features used: {X.shape[1]}")
print(f"Baseline CV R^2 (all features): {base_cv_r2:.4f}")
display(imp.head(12))
# Save
out_path = find_file(EVENT_FILE).parent / "baseline_v1_feature_importance_window_0_5.csv"
imp.to_csv(out_path, index=False)
print("Saved:", out_path)
# Quick keep/drop shortlists
keep = imp.sort_values(["lofo_delta_cv_r2","perm_drop_in_test_r2"], ascending=False).head(10)[["feature","lofo_delta_cv_r2","perm_drop_in_test_r2"]]
drop = imp.sort_values(["lofo_delta_cv_r2","perm_drop_in_test_r2"], ascending=[True, True]).head(10)[["feature","lofo_delta_cv_r2","perm_drop_in_test_r2"]]
print("\nTop KEEP candidates:")
display(keep)
print("\nTop DROP candidates:")
display(drop)
Rows used: 129 | Features used: 16 Baseline CV R^2 (all features): -0.0331
| feature | lofo_delta_cv_r2 | perm_drop_in_test_r2 | mean_abs_std_coef | rank_lofo | rank_perm | rank_coef | aggregate_rank | action | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | pre_ret_3d | 0.108232 | 0.173534 | 0.019858 | 1.0 | 1.0 | 2.0 | 1.333333 | keep_or_review |
| 1 | eps_surprise_pct | 0.107472 | 0.143380 | 0.018775 | 2.0 | 4.0 | 3.0 | 3.000000 | keep_or_review |
| 2 | vix_level_lag1 | 0.035916 | 0.150289 | 0.018590 | 3.0 | 3.0 | 4.0 | 3.333333 | keep_or_review |
| 3 | macro_us10y | 0.019432 | 0.136189 | 0.022850 | 4.0 | 5.0 | 1.0 | 3.333333 | keep_or_review |
| 4 | mkt_ret_5d_lag1 | 0.002695 | 0.129564 | 0.013140 | 6.0 | 6.0 | 8.0 | 6.666667 | keep_or_review |
| 5 | mkt_ret_10d_lag1 | -0.033044 | 0.153363 | 0.018457 | 13.0 | 2.0 | 5.0 | 6.666667 | candidate_to_drop |
| 6 | pre_ret_5d | -0.032872 | 0.101560 | 0.015816 | 12.0 | 7.0 | 6.0 | 8.333333 | candidate_to_drop |
| 7 | vix_chg_10d_lag1 | 0.006345 | 0.082645 | 0.010040 | 5.0 | 8.0 | 12.0 | 8.333333 | keep_or_review |
| 8 | vix_chg_5d_lag1 | -0.005810 | 0.058900 | 0.010042 | 9.0 | 9.0 | 11.0 | 9.666667 | candidate_to_drop |
| 9 | macro_cpi_yoy | -0.023060 | 0.046462 | 0.011607 | 10.0 | 11.0 | 9.0 | 10.000000 | candidate_to_drop |
| 10 | pre_vol_10d | -0.034472 | 0.058481 | 0.010715 | 14.0 | 10.0 | 10.0 | 11.333333 | candidate_to_drop |
| 11 | macro_fedfunds | -0.000884 | 0.018450 | 0.009147 | 8.0 | 13.0 | 13.0 | 11.333333 | candidate_to_drop |
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\baseline_v1_feature_importance_window_0_5.csv Top KEEP candidates:
| feature | lofo_delta_cv_r2 | perm_drop_in_test_r2 | |
|---|---|---|---|
| 0 | pre_ret_3d | 0.108232 | 0.173534 |
| 1 | eps_surprise_pct | 0.107472 | 0.143380 |
| 2 | vix_level_lag1 | 0.035916 | 0.150289 |
| 3 | macro_us10y | 0.019432 | 0.136189 |
| 7 | vix_chg_10d_lag1 | 0.006345 | 0.082645 |
| 4 | mkt_ret_5d_lag1 | 0.002695 | 0.129564 |
| 14 | pre_vol_5d | -0.000318 | -0.012311 |
| 11 | macro_fedfunds | -0.000884 | 0.018450 |
| 8 | vix_chg_5d_lag1 | -0.005810 | 0.058900 |
| 9 | macro_cpi_yoy | -0.023060 | 0.046462 |
Top DROP candidates:
| feature | lofo_delta_cv_r2 | perm_drop_in_test_r2 | |
|---|---|---|---|
| 12 | pre_ret_10d | -0.045927 | 0.036502 |
| 15 | mkt_ret_1d_lag1 | -0.040140 | 0.009476 |
| 10 | pre_vol_10d | -0.034472 | 0.058481 |
| 5 | mkt_ret_10d_lag1 | -0.033044 | 0.153363 |
| 6 | pre_ret_5d | -0.032872 | 0.101560 |
| 13 | pre_vol_3d | -0.030640 | 0.015875 |
| 9 | macro_cpi_yoy | -0.023060 | 0.046462 |
| 8 | vix_chg_5d_lag1 | -0.005810 | 0.058900 |
| 11 | macro_fedfunds | -0.000884 | 0.018450 |
| 14 | pre_vol_5d | -0.000318 | -0.012311 |
In [ ]:
In [8]:
import os
from typing import Dict, List, Optional, Tuple
import numpy as np
import pandas as pd
import requests
from dotenv import load_dotenv
from pandas_datareader import data as web
import yfinance as yf
# ================== CONFIG & ENV ================== #
load_dotenv()
FRED_API_KEY = os.getenv("FRED_API_KEY")
ALPHAVANTAGE_API_KEY = os.getenv("ALPHAVANTAGE_API_KEY")
# Tickers for the strategy
TICKERS: List[str] = ["AAPL", "NVDA", "GOOGL"]
# Backtest date range
BACKTEST_START = "2000-01-01"
BACKTEST_END: Optional[str] = None # None = today
# Event study settings
# IMPORTANT: 0..5 means 6 daily returns: r0..r5, i.e. from C[-1] -> C[5]
EVENT_WINDOW = (0, 5) # returns at day0..day5
ESTIMATION_LOOKBACK = 120 # -120..-20 trading days
ESTIMATION_GAP = 20 # gap from day0 back to end of estimation
WINSOR_P = 0.01 # 1% tails for returns
# ================== GENERIC HELPERS ================== #
def get_date_range() -> Tuple[str, str]:
start_dt = pd.to_datetime(BACKTEST_START)
if BACKTEST_END is None:
end_dt = pd.Timestamp.today().normalize()
else:
end_dt = pd.to_datetime(BACKTEST_END).normalize()
return start_dt.strftime("%Y-%m-%d"), end_dt.strftime("%Y-%m-%d")
def winsorize_series(s: pd.Series, p: float) -> pd.Series:
if s.empty:
return s
lower = s.quantile(p)
upper = s.quantile(1.0 - p)
return s.clip(lower, upper)
# ================== PRICES VIA YFINANCE + LOCAL CSV ================== #
def fetch_prices_yf(symbol: str, start: str, end: str) -> Optional[pd.DataFrame]:
"""
Pull daily OHLCV from Yahoo via yfinance.
"""
try:
df = yf.download(symbol, start=start, end=end, auto_adjust=False, progress=False)
except Exception as e:
print(f"{symbol}: yfinance price download failed: {e}")
return None
if df is None or df.empty:
print(f"{symbol}: yfinance returned no price data.")
return None
df = df.copy()
df.index = pd.to_datetime(df.index)
df.index.name = "date"
df = df.sort_index()
df = df.rename(
columns={
"Open": "open",
"High": "high",
"Low": "low",
"Close": "close",
"Adj Close": "adj_close",
"Volume": "volume",
}
)
for col in ["open", "high", "low", "close", "adj_close", "volume"]:
if col not in df.columns:
df[col] = np.nan
return df[["open", "high", "low", "close", "adj_close", "volume"]]
def get_prices_with_fallback(symbol: str, start: str, end: str) -> Optional[pd.DataFrame]:
"""
Try prices in this order:
1) yfinance (Yahoo)
2) Local CSV fallback: {SYMBOL}.csv
3) Legacy fallback: {SYMBOL}_from_prices_clean.csv
"""
# 1) Try online from yfinance
df = fetch_prices_yf(symbol, start, end)
if df is not None and not df.empty:
print(f"{symbol}: got {len(df)} daily rows from yfinance")
return df
# 2–3) Try local files
candidate_files = [
f"{symbol}.csv",
f"{symbol}_from_prices_clean.csv",
]
for csv_path in candidate_files:
if csv_path and os.path.exists(csv_path):
print(f"{symbol}: using local price file {csv_path} as fallback.")
df_local = pd.read_csv(csv_path)
if "date" in df_local.columns:
df_local["date"] = pd.to_datetime(df_local["date"])
df_local = df_local.sort_values("date").set_index("date")
else:
df_local.index = pd.to_datetime(df_local.index)
df_local = df_local.sort_index()
# ensure adj_close exists
if "adj_close" not in df_local.columns:
if "close" in df_local.columns:
df_local["adj_close"] = df_local["close"]
else:
df_local["adj_close"] = np.nan
# filter to requested date range
mask = (df_local.index >= pd.to_datetime(start)) & (df_local.index <= pd.to_datetime(end))
df_local = df_local.loc[mask]
for col in ["open", "high", "low", "close", "adj_close", "volume"]:
if col not in df_local.columns:
df_local[col] = np.nan
df_local.index.name = "date"
return df_local[["open", "high", "low", "close", "adj_close", "volume"]]
print(f"{symbol}: FAILED to get prices from yfinance and no local CSV found.")
return None
def download_all_prices(start: str, end: str) -> Dict[str, pd.DataFrame]:
"""
Download prices for all tickers and save per-ticker CSVs:
AAPL.csv, NVDA.csv, GOOGL.csv
Each file has columns:
date, open, high, low, close, adj_close, volume
Also returns a dict of per-ticker DataFrames indexed by date.
"""
px_raw: Dict[str, pd.DataFrame] = {}
for sym in TICKERS:
df = get_prices_with_fallback(sym, start, end)
if df is None or df.empty:
print(f"{sym}: no price data available at all.")
continue
df = df.copy()
# make sure we have a DatetimeIndex named 'date'
if not isinstance(df.index, pd.DatetimeIndex):
if "date" in df.columns:
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date")
else:
df.index = pd.to_datetime(df.index)
df.index.name = "date"
df = df.sort_index()
# store in memory for the rest of the pipeline
px_raw[sym] = df[["open", "high", "low", "close", "adj_close", "volume"]]
# write per-ticker CSV with a date column, not index
out = df.reset_index()
out_path = f"{sym}.csv"
try:
out.to_csv(out_path, index=False)
print(f"Saved {out_path} with {len(out)} rows.")
except PermissionError:
alt = f"{sym}_new.csv"
out.to_csv(alt, index=False)
print(
f"Could not overwrite {out_path} (maybe open in Excel). "
f"Saved prices to {alt} instead."
)
if not px_raw:
print("No prices downloaded – no per-ticker CSVs written.")
return px_raw
# ================== FAMA–FRENCH 3 FACTORS (DAILY, FLAT) ================== #
def fetch_ff_factors(start: str, end: str) -> pd.DataFrame:
"""
Fetch daily Fama–French 3 factors, flatten to a DataFrame with a 'date' column,
and write ff_factors_daily.csv.
"""
print("Fetching Fama–French 3 factors (daily)...")
start_dt = pd.to_datetime(start)
end_dt = pd.to_datetime(end)
ff3 = web.DataReader("F-F_Research_Data_Factors_Daily", "famafrench", start_dt)[0]
ff3 = ff3.copy()
ff3.index = pd.to_datetime(ff3.index)
ff3 = ff3[(ff3.index >= start_dt) & (ff3.index <= end_dt)]
df = ff3.rename(
columns={
"Mkt-RF": "Mkt_RF",
"SMB": "SMB",
"HML": "HML",
"RF": "RF",
}
)
df = df.reset_index()
date_col = df.columns[0]
df = df.rename(columns={date_col: "date"})
df["date"] = pd.to_datetime(df["date"])
for col in ["Mkt_RF", "SMB", "HML", "RF"]:
df[col] = df[col] / 100.0
df_out = df[["date", "Mkt_RF", "SMB", "HML", "RF"]].copy()
df_out.to_csv("ff_factors_daily.csv", index=False)
print("Saved ff_factors_daily.csv")
return df_out
# ================== EARNINGS: ALPHA VANTAGE ONLY ================== #
def fetch_earnings_alpha_vantage(symbol: str, start_dt: pd.Timestamp, end_dt: pd.Timestamp) -> pd.DataFrame:
"""
Reported EPS and estimate from Alpha Vantage EARNINGS endpoint.
"""
if not ALPHAVANTAGE_API_KEY:
print("ALPHAVANTAGE_API_KEY not set – no EPS.")
return pd.DataFrame()
url = "https://www.alphavantage.co/query"
params = {
"function": "EARNINGS",
"symbol": symbol,
"apikey": ALPHAVANTAGE_API_KEY,
}
try:
r = requests.get(url, params=params, timeout=20)
r.raise_for_status()
data = r.json()
except Exception as e:
print(f"{symbol}: Alpha Vantage earnings failed: {e}")
return pd.DataFrame()
q = data.get("quarterlyEarnings", [])
if not q:
print(f"{symbol}: Alpha Vantage returned no quarterlyEarnings.")
return pd.DataFrame()
rows = []
for item in q:
d_str = item.get("reportedDate") or item.get("fiscalDateEnding")
if not d_str:
continue
ad = pd.to_datetime(d_str).normalize()
if ad < start_dt or ad > end_dt:
continue
rep = item.get("reportedEPS")
est = item.get("estimatedEPS")
if rep is None or est is None:
continue
try:
eps_actual = float(rep)
eps_est_val = float(est)
except Exception:
continue
rows.append(
{
"ticker": symbol,
"announce_date": ad,
"eps_actual": eps_actual,
"eps_est": eps_est_val,
}
)
if not rows:
print(f"{symbol}: Alpha Vantage had no usable EPS rows.")
return pd.DataFrame()
df = pd.DataFrame(rows)
df["ticker"] = symbol
return df
def combine_all_eps_sources(start_dt: pd.Timestamp, end_dt: pd.Timestamp) -> pd.DataFrame:
"""
Try to get EPS online first.
If no online EPS is found for ANY ticker, fall back to local eps_master.csv.
ALWAYS:
- return a DataFrame with columns:
ticker, announce_date, eps_actual, eps_est, n_sources
- write a working copy to eventearnings.csv
eps_master.csv is treated as your "master" backup:
- If it already exists, we NEVER overwrite it.
- If it does NOT exist and we DO have online data, we create it once.
"""
all_rows: List[pd.DataFrame] = []
backup_path = "eps_master.csv"
# ---------- 1) TRY ONLINE EPS (Alpha Vantage) ----------
for sym in TICKERS:
print(f"\nFetching EPS for {sym} from Alpha Vantage...")
av_df = fetch_earnings_alpha_vantage(sym, start_dt, end_dt)
if av_df is not None and not av_df.empty:
all_rows.append(av_df)
# ---------- 2) NO ONLINE DATA → FALL BACK TO eps_master.csv ----------
if not all_rows:
if os.path.exists(backup_path):
print("No EPS from Alpha Vantage – using local eps_master.csv backup.")
backup = pd.read_csv(backup_path)
if backup.empty:
print("Backup eps_master.csv is empty.")
cols = ["ticker", "announce_date", "eps_actual", "eps_est", "n_sources"]
empty = pd.DataFrame(columns=cols)
empty.to_csv("eventearnings.csv", index=False)
print("Saved empty eventearnings.csv.")
return empty
needed_cols = {"ticker", "announce_date", "eps_actual", "eps_est"}
missing = needed_cols.difference(backup.columns)
if missing:
print(f"Backup eps_master.csv is missing columns {missing}.")
cols = ["ticker", "announce_date", "eps_actual", "eps_est", "n_sources"]
empty = pd.DataFrame(columns=cols)
empty.to_csv("eventearnings.csv", index=False)
print("Saved empty eventearnings.csv.")
return empty
# Clean and filter to backtest range
backup = backup.copy()
backup["ticker"] = backup["ticker"].astype(str).str.upper()
backup["announce_date"] = pd.to_datetime(backup["announce_date"]).dt.normalize()
mask = (backup["announce_date"] >= start_dt) & (backup["announce_date"] <= end_dt)
master = backup.loc[mask].copy()
if "n_sources" not in master.columns:
master["n_sources"] = 1
master = master.sort_values(["ticker", "announce_date"]).reset_index(drop=True)
# IMPORTANT: only write WORKING COPY
master.to_csv("eventearnings.csv", index=False)
print(f"Using {len(master)} EPS rows from local backup. Saved eventearnings.csv.")
return master
# No online data and no backup file
print("No EPS from Alpha Vantage and no eps_master.csv backup – creating empty tables.")
cols = ["ticker", "announce_date", "eps_actual", "eps_est", "n_sources"]
empty = pd.DataFrame(columns=cols)
empty.to_csv("eventearnings.csv", index=False)
print("Saved empty eventearnings.csv.")
return empty
# ---------- 3) WE HAVE ONLINE DATA → BUILD MASTER FROM IT ----------
eps_all = pd.concat(all_rows, ignore_index=True)
eps_all["ticker"] = eps_all["ticker"].astype(str).str.upper()
eps_all["announce_date"] = pd.to_datetime(eps_all["announce_date"]).dt.normalize()
eps_all = eps_all.sort_values(["ticker", "announce_date"])
eps_all = eps_all.drop_duplicates(subset=["ticker", "announce_date"], keep="last")
if "n_sources" not in eps_all.columns:
eps_all["n_sources"] = 1
master = eps_all[["ticker", "announce_date", "eps_actual", "eps_est", "n_sources"]].copy()
master = master.sort_values(["ticker", "announce_date"]).reset_index(drop=True)
# ALWAYS write the events file as a COPY of eps_master
master.to_csv("eventearnings.csv", index=False)
print(f"Saved eventearnings.csv with {len(master)} rows from Alpha Vantage.")
# Only create eps_master.csv automatically if it does NOT exist yet
if not os.path.exists(backup_path):
master.to_csv(backup_path, index=False)
print("No existing eps_master.csv found, so saved a new master from online data.")
else:
print("Existing eps_master.csv detected – leaving it untouched.")
return master
# ---------- 3) WE HAVE ONLINE DATA → CLEAN AND SAVE ----------
eps_all = pd.concat(all_rows, ignore_index=True)
eps_all["ticker"] = eps_all["ticker"].astype(str).str.upper()
eps_all["announce_date"] = pd.to_datetime(eps_all["announce_date"]).dt.normalize()
eps_all = eps_all.sort_values(["ticker", "announce_date"])
eps_all = eps_all.drop_duplicates(subset=["ticker", "announce_date"], keep="last")
eps_all["n_sources"] = 1
master = eps_all[["ticker", "announce_date", "eps_actual", "eps_est", "n_sources"]].copy()
master = master.sort_values(["ticker", "announce_date"]).reset_index(drop=True)
master.to_csv("eps_master.csv", index=False)
print(f"Saved eps_master.csv with {len(master)} rows (Alpha Vantage).")
return master
# ================== FEATURES TABLE (eps_surprise_pct, pre_ret_3d) ================== #
def build_features_table(
eps_master: pd.DataFrame, px_raw: Dict[str, pd.DataFrame]
) -> pd.DataFrame:
"""
Build features_model.csv with:
- eps_surprise_pct = (eps_actual - eps_est) / |eps_est|
- pre_ret_3d = Price(D-1) / Price(D-4) - 1
Day0 = first trading day AFTER announce_date (AMC).
"""
records: List[dict] = []
if eps_master.empty:
features_df = pd.DataFrame(
columns=[
"ticker",
"announce_date",
"day0",
"eps_actual",
"eps_est",
"eps_surprise_pct",
"pre_ret_3d",
"n_sources",
]
)
try:
features_df.to_csv("features_model.csv", index=False)
print("Saved empty features_model.csv (no EPS events).")
except PermissionError:
alt = "features_model_new.csv"
features_df.to_csv(alt, index=False)
print(
f"Could not overwrite features_model.csv (maybe open in Excel). "
f"Saved empty features to {alt} instead."
)
return features_df
eps_df = eps_master.copy()
eps_df["ticker"] = eps_df["ticker"].astype(str).str.upper()
eps_df["announce_date"] = pd.to_datetime(eps_df["announce_date"]).dt.normalize()
for _, row in eps_df.iterrows():
sym = row["ticker"]
px = px_raw.get(sym)
if px is None or px.empty:
continue
idx = px.index
announce_date = pd.to_datetime(row["announce_date"]).normalize()
future_dates = idx[idx > announce_date]
if len(future_dates) == 0:
continue
day0 = future_dates[0]
loc0 = idx.get_loc(day0)
if loc0 < 4:
continue
loc_minus1 = loc0 - 1
loc_minus4 = loc0 - 4
price_minus1 = float(px["adj_close"].iloc[loc_minus1])
price_minus4 = float(px["adj_close"].iloc[loc_minus4])
if price_minus4 == 0.0:
continue
pre_ret_3d = price_minus1 / price_minus4 - 1.0
eps_actual = float(row["eps_actual"])
eps_est = float(row["eps_est"])
if eps_est == 0:
continue
eps_surprise_pct = (eps_actual - eps_est) / abs(eps_est)
records.append(
{
"ticker": sym,
"announce_date": announce_date,
"day0": day0,
"eps_actual": eps_actual,
"eps_est": eps_est,
"eps_surprise_pct": eps_surprise_pct,
"pre_ret_3d": pre_ret_3d,
"n_sources": int(row.get("n_sources", 1)),
}
)
if not records:
features_df = pd.DataFrame(
columns=[
"ticker",
"announce_date",
"day0",
"eps_actual",
"eps_est",
"eps_surprise_pct",
"pre_ret_3d",
"n_sources",
]
)
else:
features_df = pd.DataFrame(records)
features_df = features_df.sort_values(["ticker", "day0"]).reset_index(drop=True)
try:
features_df.to_csv("features_model.csv", index=False)
print(f"Saved features_model.csv with {len(features_df)} rows.")
except PermissionError:
alt = "features_model_new.csv"
features_df.to_csv(alt, index=False)
print(
f"Could not overwrite features_model.csv (maybe open in Excel). "
f"Saved features to {alt} instead."
)
return features_df
# ================== EVENT STUDY: CAR(0,5) WITH CORRECT 0–5 LOGIC ================== #
def build_event_study(
features_df: pd.DataFrame, px_raw: Dict[str, pd.DataFrame], ff_factors: pd.DataFrame
) -> pd.DataFrame:
"""
For each event (ticker, day0), compute CAR over 0..5 using FF3:
- Daily returns are:
r_t = AdjClose_t / AdjClose_{t-1} - 1
- Event window CAR(0,5) sums AR_0..AR_5:
6 daily abnormal returns (day0..day5),
which correspond to price move from day-1 close to day5 close.
- Estimation window: -120..-20 relative to day0
- Model: (ret - RF) ~ 1 + Mkt_RF + SMB + HML
- ret = winsorised daily return from adj_close
"""
if features_df.empty:
event_df = pd.DataFrame(
columns=[
"ticker",
"announce_date",
"day0",
"event_start",
"event_end",
"est_start",
"est_end",
"CAR_0_5",
]
)
event_df.to_csv("event_study_car_0_5.csv", index=False)
print("Saved empty event_study_car_0_5.csv (no features).")
return event_df
ff = ff_factors.copy()
ff["date"] = pd.to_datetime(ff["date"])
ff = ff.sort_values("date").set_index("date")
records: List[dict] = []
for sym in TICKERS:
px = px_raw.get(sym)
if px is None or px.empty:
continue
df_px = px.copy()
if not isinstance(df_px.index, pd.DatetimeIndex):
if "date" in df_px.columns:
df_px["date"] = pd.to_datetime(df_px["date"])
df_px = df_px.set_index("date")
else:
df_px.index = pd.to_datetime(df_px.index)
df_px = df_px.sort_index()
common_dates = df_px.index.intersection(ff.index)
common_dates = common_dates.sort_values()
if len(common_dates) == 0:
continue
merged = pd.DataFrame(index=common_dates)
merged["adj_close"] = df_px.loc[common_dates, "adj_close"].values
merged["Mkt_RF"] = ff.loc[common_dates, "Mkt_RF"].values
merged["SMB"] = ff.loc[common_dates, "SMB"].values
merged["HML"] = ff.loc[common_dates, "HML"].values
merged["RF"] = ff.loc[common_dates, "RF"].values
merged["date"] = merged.index
# daily returns (ret_t is associated with that day's close vs previous close)
merged["ret_raw"] = merged["adj_close"].pct_change()
merged["ret"] = winsorize_series(merged["ret_raw"], WINSOR_P)
idx = merged.index
ev_rows = features_df[features_df["ticker"] == sym]
if ev_rows.empty:
continue
needed = ["ret", "Mkt_RF", "SMB", "HML", "RF"]
for _, ev in ev_rows.iterrows():
day0 = pd.to_datetime(ev["day0"])
loc_candidates = np.where(idx == np.datetime64(day0))[0]
if len(loc_candidates) == 0:
continue
loc0 = int(loc_candidates[0])
# -------- EVENT WINDOW: 0..5 --------
# we need valid returns at indices loc0..loc0+5
# ret at index 0 is NaN because there is no previous day
event_start_loc = loc0 + EVENT_WINDOW[0] # should be loc0
event_end_loc = loc0 + EVENT_WINDOW[1] # loc0+5
if event_start_loc < 1: # need a previous day for ret at day0
continue
if event_end_loc >= len(merged):
continue
# -------- ESTIMATION WINDOW: -120..-20 --------
est_end = loc0 - ESTIMATION_GAP
est_start = est_end - ESTIMATION_LOOKBACK + 1
if est_start < 1 or est_end >= len(merged):
continue
est = merged.iloc[est_start: est_end + 1].copy()
if est[needed].isna().any().any():
continue
# Fit FF3 model on estimation window
y = est["ret"] - est["RF"]
X = np.column_stack(
[
np.ones(len(est)),
est["Mkt_RF"],
est["SMB"],
est["HML"],
]
)
beta_hat, *_ = np.linalg.lstsq(X, y.values, rcond=None)
# Event window rows: day0..day5
ev_df = merged.iloc[event_start_loc: event_end_loc + 1].copy()
if ev_df[needed].isna().any().any():
continue
X_ev = np.column_stack(
[
np.ones(len(ev_df)),
ev_df["Mkt_RF"],
ev_df["SMB"],
ev_df["HML"],
]
)
excess_hat = X_ev @ beta_hat
exp_ret = ev_df["RF"].values + excess_hat
# abnormal returns for days 0..5
abn = ev_df["ret"].values - exp_ret
car_0_5 = float(abn.sum())
records.append(
{
"ticker": sym,
"announce_date": ev["announce_date"],
"day0": day0,
# these are the dates for day0 .. day5
"event_start": ev_df["date"].iloc[0],
"event_end": ev_df["date"].iloc[-1],
# estimation window dates (for debugging / trust)
"est_start": est["date"].iloc[0],
"est_end": est["date"].iloc[-1],
"CAR_0_5": car_0_5,
}
)
if not records:
event_df = pd.DataFrame(
columns=[
"ticker",
"announce_date",
"day0",
"event_start",
"event_end",
"est_start",
"est_end",
"CAR_0_5",
]
)
else:
event_df = pd.DataFrame(records)
event_df = event_df.sort_values(["ticker", "day0"]).reset_index(drop=True)
event_df.to_csv("event_study_car_0_5.csv", index=False)
print(f"Saved event_study_car_0_5.csv with {len(event_df)} rows.")
return event_df
# ================== MACRO CALENDAR (USE YOUR CLEAN CSV IF PRESENT) ================== #
def fetch_macro_calendar(start: str, end: str) -> pd.DataFrame:
"""
Macro calendar logic:
1) If macro_calendar_clean.csv exists:
- Read it.
- Parse dates with dayfirst=True.
- Filter to [start, end].
- Re-save as macro_calendar.csv with:
date in dd/mm/YYYY
event_type (CPI, FOMC, etc).
2) If macro_calendar_clean.csv does not exist:
- Try to build a CPI-only calendar from FRED.
- Label all rows as event_type = "CPI".
- Save as macro_calendar.csv.
No fake “FOMC every day” rubbish.
"""
start_dt = pd.to_datetime(start)
end_dt = pd.to_datetime(end)
clean_path = "macro_calendar_clean.csv"
if os.path.exists(clean_path):
print(f"Using local macro_calendar_clean.csv as macro source.")
df = pd.read_csv(clean_path)
if "date" not in df.columns or "event_type" not in df.columns:
print("macro_calendar_clean.csv is missing 'date' or 'event_type' columns.")
df_out = pd.DataFrame(columns=["date", "event_type"])
df_out.to_csv("macro_calendar.csv", index=False)
print("Saved empty macro_calendar.csv")
return df_out
df["date"] = pd.to_datetime(df["date"], dayfirst=True, errors="coerce")
df = df.dropna(subset=["date"])
df = df[(df["date"] >= start_dt) & (df["date"] <= end_dt)]
if df.empty:
df_out = pd.DataFrame(columns=["date", "event_type"])
else:
df_out = df[["date", "event_type"]].copy()
df_out["date"] = df_out["date"].dt.strftime("%d/%m/%Y")
try:
df_out.to_csv("macro_calendar.csv", index=False)
print(f"Saved macro_calendar.csv with {len(df_out)} rows (from macro_calendar_clean.csv).")
except PermissionError:
alt = "macro_calendar_new.csv"
df_out.to_csv(alt, index=False)
print(
f"Could not overwrite macro_calendar.csv (maybe open in Excel). "
f"Saved macro calendar to {alt} instead."
)
return df_out
# Fallback: CPI-only from FRED (no FOMC)
if not FRED_API_KEY:
df_empty = pd.DataFrame(columns=["date", "event_type"])
df_empty.to_csv("macro_calendar.csv", index=False)
print("No macro_calendar_clean.csv and no FRED_API_KEY – saved empty macro_calendar.csv")
return df_empty
print("macro_calendar_clean.csv not found – building CPI-only calendar from FRED.")
base = "https://api.stlouisfed.org/fred"
common_params = {"api_key": FRED_API_KEY, "file_type": "json"}
start_str = start_dt.strftime("%Y-%m-%d")
end_str = end_dt.strftime("%Y-%m-%d")
try:
r = requests.get(base + "/releases", params=common_params, timeout=20)
r.raise_for_status()
rel_data = r.json()
releases = rel_data.get("releases", [])
except Exception as e:
print(f"Error fetching FRED releases: {e}")
df_empty = pd.DataFrame(columns=["date", "event_type"])
df_empty.to_csv("macro_calendar.csv", index=False)
return df_empty
cpi_release_ids: List[int] = []
for rel in releases:
rid = rel.get("id")
name = rel.get("name", "")
if rid is None:
continue
nl = name.lower()
if "consumer price index" in nl:
cpi_release_ids.append(rid)
records: List[dict] = []
for rid in cpi_release_ids:
params = {
"api_key": FRED_API_KEY,
"file_type": "json",
"release_id": rid,
"observation_start": start_str,
"observation_end": end_str,
}
try:
r2 = requests.get(base + "/release/dates", params=params, timeout=20)
r2.raise_for_status()
d2 = r2.json()
for item in d2.get("release_dates", []):
d_str = item.get("date")
if not d_str:
continue
ts = pd.to_datetime(d_str).normalize()
records.append({"date": ts, "event_type": "CPI"})
except Exception as e:
print(f"Error fetching FRED dates for release {rid}: {e}")
if not records:
df = pd.DataFrame(columns=["date", "event_type"])
else:
df = pd.DataFrame(records)
df = df.drop_duplicates().sort_values("date").reset_index(drop=True)
if not df.empty:
df["date"] = pd.to_datetime(df["date"]).dt.strftime("%d/%m/%Y")
try:
df.to_csv("macro_calendar.csv", index=False)
print(f"Saved macro_calendar.csv with {len(df)} rows (CPI-only FRED fallback).")
except PermissionError:
alt = "macro_calendar_new.csv"
df.to_csv(alt, index=False)
print(
f"Could not overwrite macro_calendar.csv (maybe open in Excel). "
f"Saved macro calendar to {alt} instead."
)
return df
# ================== MAIN: PIPELINE ONLY (NO STRATEGY) ================== #
def main() -> None:
start_str, end_str = get_date_range()
print(f"Date range: {start_str} to {end_str}")
print("\n--- Step 1: Download daily OHLCV prices ---")
px_raw = download_all_prices(start_str, end_str)
print("\n--- Step 2: Download Fama–French factors ---")
ff_factors = fetch_ff_factors(start_str, end_str)
print("\n--- Step 3: Download EPS from Alpha Vantage ---")
start_dt = pd.to_datetime(start_str)
end_dt = pd.to_datetime(end_str)
eps_master = combine_all_eps_sources(start_dt, end_dt)
print("\n--- Step 4: Build features table (eps_surprise_pct, pre_ret_3d) ---")
features_df = build_features_table(eps_master, px_raw)
print("\n--- Step 5: Build event study with CAR(0,5) ---")
event_df = build_event_study(features_df, px_raw, ff_factors)
print("\n--- Step 6: Build macro calendar (from macro_calendar_clean.csv if present) ---")
macro_df = fetch_macro_calendar(start_str, end_str)
print("\n--- Done (data pipeline only, no strategy yet) ---")
print(f"Prices rows (all tickers): {sum(len(df) for df in px_raw.values())}")
print(f"FF factors rows: {len(ff_factors)}")
print(f"EPS master rows: {len(eps_master)}")
print(f"Features rows: {len(features_df)}")
print(f"Event study rows: {len(event_df)}")
print(f"Macro calendar rows: {len(macro_df)}")
if __name__ == "__main__":
main()
Date range: 2000-01-01 to 2025-11-20 --- Step 1: Download daily OHLCV prices --- AAPL: got 6511 daily rows from yfinance Saved AAPL.csv with 6511 rows. NVDA: got 6511 daily rows from yfinance Saved NVDA.csv with 6511 rows. GOOGL: got 5349 daily rows from yfinance Saved GOOGL.csv with 5349 rows. --- Step 2: Download Fama–French factors --- Fetching Fama–French 3 factors (daily)...
C:\Users\dcazo\AppData\Local\Temp\ipykernel_11148\3343574371.py:208: FutureWarning: The argument 'date_parser' is deprecated and will be removed in a future version. Please use 'date_format' instead, or read your data in as 'object' dtype and then call 'to_datetime'.
ff3 = web.DataReader("F-F_Research_Data_Factors_Daily", "famafrench", start_dt)[0]
Saved ff_factors_daily.csv --- Step 3: Download EPS from Alpha Vantage --- Fetching EPS for AAPL from Alpha Vantage... ALPHAVANTAGE_API_KEY not set – no EPS. Fetching EPS for NVDA from Alpha Vantage... ALPHAVANTAGE_API_KEY not set – no EPS. Fetching EPS for GOOGL from Alpha Vantage... ALPHAVANTAGE_API_KEY not set – no EPS. No EPS from Alpha Vantage – using local eps_master.csv backup. Using 282 EPS rows from local backup. Saved eventearnings.csv. --- Step 4: Build features table (eps_surprise_pct, pre_ret_3d) ---
C:\Users\dcazo\AppData\Local\Temp\ipykernel_11148\3343574371.py:488: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead price_minus1 = float(px["adj_close"].iloc[loc_minus1]) C:\Users\dcazo\AppData\Local\Temp\ipykernel_11148\3343574371.py:489: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead price_minus4 = float(px["adj_close"].iloc[loc_minus4])
Saved features_model.csv with 282 rows. --- Step 5: Build event study with CAR(0,5) --- Saved event_study_car_0_5.csv with 273 rows. --- Step 6: Build macro calendar (from macro_calendar_clean.csv if present) --- macro_calendar_clean.csv not found – building CPI-only calendar from FRED. Saved macro_calendar.csv with 940 rows (CPI-only FRED fallback). --- Done (data pipeline only, no strategy yet) --- Prices rows (all tickers): 18371 FF factors rows: 6475 EPS master rows: 282 Features rows: 282 Event study rows: 273 Macro calendar rows: 940
In [1]:
import pandas as pd
import numpy as np
# ======================= SETTINGS ======================= #
# Core data produced by auto_pipeline_and_backtest.py
FEATURES_PATH = "features_model.csv" # eps_surprise_pct, pre_ret_3d, etc.
EVENT_STUDY_PATH = "event_study_car_0_5.csv" # CAR_0_5 per event
MACRO_PATH = "macro_calendar.csv" # macro dates to avoid (CPI / FOMC etc.)
PRICE_FILES = {
"AAPL": "AAPL.csv",
"NVDA": "NVDA.csv",
"GOOGL": "GOOGL.csv",
}
# Features we use in the regression
FEATURE_COLS = ["eps_surprise_pct", "pre_ret_3d"]
# Training + trading universe
TRAIN_START_DATE = pd.Timestamp("2010-01-01") # only train on events from here onwards
MIN_TRAIN_EVENTS = 80 # minimum training events before we trade
# Transaction cost per leg per side (0.0005 = 0.05%)
COST_RATE = 0.0005
# Capital we pretend the fund allocates to this strategy
ASSUMED_CAPITAL = 10_000_000
# Avoid trading on macro dates (day0 in macro_calendar)
USE_MACRO_FILTER = True
# ======================= HELPERS ======================= #
def score_to_allocation_dollars(score: float) -> float:
"""
Map the model score to a dollar position.
Positive score -> long notional. Negative score -> short notional.
Uses your preferred stepped ladder:
|score| < 0.5 -> 0
0.5 <= |score| < 0.7 -> 200k
0.7 <= |score| < 0.9 -> 400k
|score| >= 0.9 -> 600k
"""
s = abs(score)
if s < 0.3:
alloc = 0.0
elif s < 0.4:
alloc = 200_000.0
elif s < 0.5:
alloc = 400_000.0
else:
alloc = 600_000.0
return float(np.sign(score) * alloc)
def load_prices():
"""
Load AAPL, NVDA, GOOGL daily prices from CSVs produced by the pipeline.
Expected columns in each file:
ticker, date, open, high, low, close, adj_close, volume
"""
frames = []
for ticker, path in PRICE_FILES.items():
df = pd.read_csv(path)
if "ticker" not in df.columns:
df["ticker"] = ticker
df["date"] = pd.to_datetime(df["date"]).dt.normalize()
needed = ["ticker", "date", "open", "high", "low", "close", "adj_close", "volume"]
missing = [c for c in needed if c not in df.columns]
if missing:
raise ValueError(f"{ticker}: missing columns {missing} in {path}")
df = df[needed]
frames.append(df)
prices = pd.concat(frames, ignore_index=True)
prices = prices.sort_values(["ticker", "date"])
prices.set_index(["ticker", "date"], inplace=True)
return prices
def load_macro_dates():
"""
Load macro dates (CPI / FOMC etc.) and return a set of dates to avoid trading.
"""
try:
macro = pd.read_csv(MACRO_PATH)
except FileNotFoundError:
print("macro_calendar.csv not found – no macro filtering will be applied.")
return set()
if "date" not in macro.columns:
print("macro_calendar.csv has no 'date' column – no macro filtering will be applied.")
return set()
macro["date"] = pd.to_datetime(macro["date"]).dt.normalize()
unique_dates = set(macro["date"].unique())
print(f"Loaded {len(unique_dates)} unique macro dates to avoid (day0 only).")
return unique_dates
# ======================= MAIN BACKTEST ======================= #
def backtest_directional():
# 1) Load features and event study
features = pd.read_csv(FEATURES_PATH)
events = pd.read_csv(EVENT_STUDY_PATH)
# Ensure proper date types
for df in (features, events):
if "announce_date" in df.columns:
df["announce_date"] = pd.to_datetime(df["announce_date"]).dt.normalize()
if "day0" in df.columns:
df["day0"] = pd.to_datetime(df["day0"]).dt.normalize()
# Merge on ticker + dates
merge_keys = ["ticker", "announce_date", "day0"]
df = pd.merge(
events,
features,
on=merge_keys,
how="inner",
suffixes=("", "_feat"),
)
# Keep only the columns we care about
if "CAR_0_5" not in df.columns:
raise ValueError("Expected 'CAR_0_5' column in event_study_car_0_5.csv")
# Filter to training/trading universe (day0 >= TRAIN_START_DATE)
df = df[df["day0"] >= TRAIN_START_DATE].copy()
df = df.sort_values("day0").reset_index(drop=True)
n_events = len(df)
print(f"Total events in sample (day0 >= {TRAIN_START_DATE.date()}): {n_events}")
if n_events == 0:
print("No events after TRAIN_START_DATE – nothing to backtest.")
return None
# Drop rows with missing features
df = df.dropna(subset=FEATURE_COLS + ["CAR_0_5"]).reset_index(drop=True)
# 2) Load prices
prices = load_prices()
# 3) Macro dates
macro_dates = load_macro_dates() if USE_MACRO_FILTER else set()
records = []
# 4) Walk forward through time (event by event)
for i in range(len(df)):
row = df.iloc[i]
ticker = row["ticker"]
day0 = row["day0"]
announce_date = row["announce_date"]
car_0_5 = float(row["CAR_0_5"]) # factor-adjusted CAR(0,5)
x_i = row[FEATURE_COLS].values.astype(float)
score = np.nan
pos_dollars = 0.0
pnl_dollars = 0.0
raw_ret_0_5 = np.nan
exit_date = pd.NaT
skipped_macro = False
# Build training sample: all prior events (by day0) after TRAIN_START_DATE
train_mask = (df["day0"] < day0) & (df["day0"] >= TRAIN_START_DATE)
train = df[train_mask]
if len(train) >= MIN_TRAIN_EVENTS:
X_train = train[FEATURE_COLS].values.astype(float)
y_train = train["CAR_0_5"].values.astype(float)
# Linear regression with intercept
X_mat = np.column_stack([np.ones(len(X_train)), X_train])
beta_hat, *_ = np.linalg.lstsq(X_mat, y_train, rcond=None)
intercept = beta_hat[0]
coef = beta_hat[1:]
# Residual std and mean CAR on training sample
y_hat_train = X_mat @ beta_hat
resid = y_train - y_hat_train
sigma_resid = resid.std(ddof=1)
mean_car = y_train.mean()
# Prediction for this event
car_hat = intercept + np.dot(coef, x_i)
car_feat = car_hat - mean_car
score = car_feat / sigma_resid if sigma_resid > 0 else 0.0
# Decide dollar position
pos_dollars = score_to_allocation_dollars(score)
# Macro filter: avoid trading on macro dates (day0)
if USE_MACRO_FILTER and day0 in macro_dates:
skipped_macro = True
pos_dollars = 0.0
# If we actually take a position, compute PnL using daily prices
if pos_dollars != 0.0:
try:
px_tkr = prices.loc[ticker]
# Entry at day0 open
row0 = px_tkr.loc[day0]
open0 = float(row0["open"])
# Take up to 6 trading days from day0 (day0..day5)
px_window = px_tkr.loc[day0:].iloc[:6]
# Exit at last available in that 0..5 window
row_exit = px_window.iloc[-1]
exit_date = row_exit.name
close_exit = float(row_exit["adj_close"])
raw_ret_0_5 = (close_exit - open0) / open0
# Trading costs (open + close)
trade_cost = 2.0 * COST_RATE * abs(pos_dollars)
# PnL in dollars
pnl_dollars = pos_dollars * raw_ret_0_5 - trade_cost
except KeyError:
# Missing price data -> no trade
pos_dollars = 0.0
pnl_dollars = 0.0
raw_ret_0_5 = np.nan
exit_date = pd.NaT
# Save record for this event
records.append({
"announce_date": announce_date,
"day0": day0,
"exit_date": exit_date,
"ticker": ticker,
"CAR_0_5": car_0_5,
"score": score,
"position_dollars": pos_dollars,
"raw_ret_0_5": raw_ret_0_5,
"pnl_dollars": pnl_dollars,
"skipped_macro": skipped_macro,
})
bt = pd.DataFrame(records)
# 5) Keep only actual trades
trades = bt[bt["position_dollars"] != 0].copy()
trades = trades.sort_values("day0").reset_index(drop=True)
out_path = "backtest_directional_trades.csv"
trades.to_csv(out_path, index=False)
print(f"\nSaved trade details to {out_path}")
n_trades = len(trades)
print(f"\nNumber of trades: {n_trades}")
if n_trades == 0:
print("No trades taken with current settings.")
return bt
# 6) Dollar-level stats
total_pnl = trades["pnl_dollars"].sum()
avg_pnl = trades["pnl_dollars"].mean()
med_pnl = trades["pnl_dollars"].median()
std_pnl = trades["pnl_dollars"].std(ddof=1)
hit_rate = (trades["pnl_dollars"] > 0).mean()
worst = trades["pnl_dollars"].min()
best = trades["pnl_dollars"].max()
print(f"Total PnL: ${total_pnl:,.2f}")
print(f"Average PnL per trade: ${avg_pnl:,.2f}")
print(f"Median PnL per trade: ${med_pnl:,.2f}")
print(f"Std dev PnL per trade: ${std_pnl:,.2f}")
print(f"Hit rate: {hit_rate:.3f}")
print(f"Worst trade: ${worst:,.2f}")
print(f"Best trade: ${best:,.2f}")
# Trades by size
trades["abs_pos"] = trades["position_dollars"].abs()
tier_counts = trades["abs_pos"].value_counts().sort_index()
print("\nTrades by size:")
for size, count in tier_counts.items():
print(f" ${int(size):,}: {count} trades")
# 7) Returns as % of position
trades["ret_pct_of_pos"] = trades["pnl_dollars"] / trades["position_dollars"].abs()
avg_ret_pct = trades["ret_pct_of_pos"].mean()
med_ret_pct = trades["ret_pct_of_pos"].median()
std_ret_pct = trades["ret_pct_of_pos"].std(ddof=1)
worst_ret_pct = trades["ret_pct_of_pos"].min()
best_ret_pct = trades["ret_pct_of_pos"].max()
print("\nReturn per trade as % of position:")
print(f"Average: {100 * avg_ret_pct:.2f}%")
print(f"Median: {100 * med_ret_pct:.2f}%")
print(f"Std dev: {100 * std_ret_pct:.2f}%")
print(f"Worst: {100 * worst_ret_pct:.2f}%")
print(f"Best: {100 * best_ret_pct:.2f}%")
print(f"\nHit rate: {hit_rate*100:.1f}% of trades are profitable")
# 8) Portfolio view on a 10m book
total_pnl_pct_of_book = total_pnl / ASSUMED_CAPITAL
start_date = trades["day0"].min()
end_date = trades["exit_date"].max() if trades["exit_date"].notna().any() else trades["day0"].max()
years = (end_date - start_date).days / 365.25 if pd.notna(end_date) else 0.0
trades_per_year = n_trades / years if years > 0 else np.nan
equity_final = 1.0 + total_pnl_pct_of_book
ann_return = equity_final ** (1.0 / years) - 1.0 if years > 0 else np.nan
print(f"\nPortfolio view assuming ${ASSUMED_CAPITAL:,} allocated to this strategy:")
print(f"Total PnL as % of book: {100 * total_pnl_pct_of_book:.2f}%")
print(f"Trading period: {start_date.date()} to {end_date.date()} (~{years:.2f} years)")
print(f"Trades per year: {trades_per_year:.2f}")
print(f"Approx annualised return on the book: {ann_return*100:.2f}%")
# 9) Equity curve + simple annualised Sharpe on book
equity = 1.0
equity_curve = []
for _, tr in trades.iterrows():
equity *= (1.0 + tr["pnl_dollars"] / ASSUMED_CAPITAL)
equity_curve.append(equity)
equity_series = pd.Series(equity_curve, index=trades["exit_date"].reset_index(drop=True))
peak = equity_series.cummax()
drawdowns = equity_series / peak - 1.0
max_drawdown = drawdowns.min()
# Approx annual volatility from per-trade pnl on book
per_trade_ret_on_book = trades["pnl_dollars"] / ASSUMED_CAPITAL
std_per_trade = per_trade_ret_on_book.std(ddof=1)
ann_vol = std_per_trade * np.sqrt(trades_per_year) if trades_per_year > 0 else np.nan
sharpe_ann = (ann_return / ann_vol) if (ann_vol not in [0, np.nan] and years > 0) else np.nan
print("\nEquity curve on 10m book:")
print(f"Final capital (starting from 1.0): {equity_series.iloc[-1]:.6f}")
print(f"Maximum drawdown: {max_drawdown:.6f}")
print(f"Approx annual volatility: {ann_vol:.6f}")
print(f"Approx annual Sharpe: {sharpe_ann:.3f}")
# 10) Per-tier average % return
print("\nPer-tier average return (as % of position):")
for size, count in tier_counts.items():
sub = trades[trades["abs_pos"] == size]
avg_pct_tier = (sub["pnl_dollars"] / sub["position_dollars"].abs()).mean()
print(f" Size ${int(size):,}: n={count}, avg = {100 * avg_pct_tier:.2f}%")
return bt
if __name__ == "__main__":
backtest_directional()
Total events in sample (day0 >= 2010-01-01): 189 Loaded 940 unique macro dates to avoid (day0 only).
C:\Users\dcazo\AppData\Local\Temp\ipykernel_11640\3952785008.py:103: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning. macro["date"] = pd.to_datetime(macro["date"]).dt.normalize()
Saved trade details to backtest_directional_trades.csv Number of trades: 13 Total PnL: $258,254.62 Average PnL per trade: $19,865.74 Median PnL per trade: $14,111.36 Std dev PnL per trade: $16,718.96 Hit rate: 0.846 Worst trade: $-1,500.84 Best trade: $51,908.29 Trades by size: $200,000: 8 trades $400,000: 2 trades $600,000: 3 trades Return per trade as % of position: Average: 6.06% Median: 5.98% Std dev: 4.68% Worst: -0.75% Best: 15.40% Hit rate: 84.6% of trades are profitable Portfolio view assuming $10,000,000 allocated to this strategy: Total PnL as % of book: 2.58% Trading period: 2016-11-11 to 2024-02-29 (~7.30 years) Trades per year: 1.78 Approx annualised return on the book: 0.35% Equity curve on 10m book: Final capital (starting from 1.0): 1.026118 Maximum drawdown: -0.000150 Approx annual volatility: 0.002231 Approx annual Sharpe: 1.568 Per-tier average return (as % of position): Size $200,000: n=8, avg = 6.26% Size $400,000: n=2, avg = 3.49% Size $600,000: n=3, avg = 7.24%
In [3]:
import math
import numpy as np
import pandas as pd
# ================== SETTINGS ================== #
EVENT_STUDY_PATH = "event_study_car_0_5.csv"
FEATURES_PATH = "features_model.csv"
MACRO_CALENDAR_PATH = "macro_calendar.csv"
PRICE_FILES = {
"AAPL": "AAPL.csv",
"NVDA": "NVDA.csv",
"GOOGL": "GOOGL.csv",
}
ASSUMED_CAPITAL = 10_000_000
COST_RATE = 0.0005
FEATURE_COLS = ["eps_surprise_pct", "pre_ret_3d"]
TRAIN_START_DATE = pd.Timestamp("2010-01-01")
MIN_TRAIN_EVENTS = 80
TUNING_END_DATE = pd.Timestamp("2018-12-31")
FORWARD_START_DATE = pd.Timestamp("2019-01-01")
# =================================================
# Macro calendar loader
# =================================================
def load_macro_dates(path: str = MACRO_CALENDAR_PATH) -> set:
try:
macro = pd.read_csv(path)
except FileNotFoundError:
print("Macro calendar file not found – no macro filter applied.")
return set()
if "date" not in macro.columns:
print("Macro calendar has no 'date' column – no macro filter applied.")
return set()
# Your macro calendar is dd/mm/YYYY
macro["date"] = pd.to_datetime(macro["date"], dayfirst=True, errors="coerce").dt.normalize()
dates = set(macro["date"].dropna().unique())
print(f"Loaded {len(dates)} unique macro dates to avoid (day0 only).")
return dates
# =================================================
# Price loader – tailored to your pipeline files
# =================================================
def load_prices() -> pd.DataFrame:
"""
Load AAPL/NVDA/GOOGL from separate CSVs and standardise to:
index = [ticker, date]
cols = open, high, low, close, adj_close, volume
"""
all_frames = []
for sym, path in PRICE_FILES.items():
df = pd.read_csv(path)
# --- date handling ---
if "date" in df.columns:
df["date"] = pd.to_datetime(df["date"]).dt.normalize()
elif "Date" in df.columns:
df.rename(columns={"Date": "date"}, inplace=True)
df["date"] = pd.to_datetime(df["date"]).dt.normalize()
else:
# assume first column is date
first = df.columns[0]
df.rename(columns={first: "date"}, inplace=True)
df["date"] = pd.to_datetime(df["date"]).dt.normalize()
# --- add ticker if missing ---
if "ticker" not in df.columns:
df["ticker"] = sym
else:
# normalise ticker in case it's mixed case
df["ticker"] = df["ticker"].fillna(sym).astype(str)
# --- unify column names ---
cols = {c.lower(): c for c in df.columns}
def rename_if_exists(old, new):
if old in df.columns:
df.rename(columns={old: new}, inplace=True)
# yfinance-style to lower snake
rename_if_exists("Open", "open")
rename_if_exists("High", "high")
rename_if_exists("Low", "low")
rename_if_exists("Close", "close")
rename_if_exists("Adj Close", "adj_close")
rename_if_exists("Adj close", "adj_close")
rename_if_exists("Adj_Close", "adj_close")
# ensure essential columns exist; if open/high/low missing, copy close
if "close" not in df.columns:
raise ValueError(f"{sym}: no 'close' column found in {path}")
if "open" not in df.columns:
df["open"] = df["close"]
if "high" not in df.columns:
df["high"] = df["close"]
if "low" not in df.columns:
df["low"] = df["close"]
if "adj_close" not in df.columns:
df["adj_close"] = df["close"]
if "volume" not in df.columns:
df["volume"] = np.nan
# keep only what we need
df = df[["ticker", "date", "open", "high", "low", "close", "adj_close", "volume"]].copy()
# enforce numeric on price columns
for c in ["open", "high", "low", "close", "adj_close", "volume"]:
df[c] = pd.to_numeric(df[c], errors="coerce")
df = df.sort_values("date").reset_index(drop=True)
all_frames.append(df)
prices = pd.concat(all_frames, ignore_index=True)
prices = prices.sort_values(["ticker", "date"])
prices.set_index(["ticker", "date"], inplace=True)
return prices
# =================================================
# Score -> dollar position
# (your "lowered" thresholds)
# =================================================
def score_to_allocation_dollars(score: float) -> float:
s = abs(score)
if s < 0.30:
alloc = 0.0
elif s < 0.40:
alloc = 200_000.0
elif s < 0.50:
alloc = 400_000.0
else:
alloc = 600_000.0
return float(np.sign(score) * alloc)
# =================================================
# Trade summary helper
# =================================================
def summarise_trades(trades: pd.DataFrame, label: str):
print(f"\n================ {label} ================")
n_trades = len(trades)
print(f"Number of trades: {n_trades}")
if n_trades == 0:
return
total_pnl = trades["pnl_dollars"].sum()
avg_pnl = trades["pnl_dollars"].mean()
med_pnl = trades["pnl_dollars"].median()
std_pnl = trades["pnl_dollars"].std(ddof=1)
hit_rate = (trades["pnl_dollars"] > 0).mean()
worst = trades["pnl_dollars"].min()
best = trades["pnl_dollars"].max()
print(f"Total PnL: ${total_pnl:,.2f}")
print(f"Average PnL per trade: ${avg_pnl:,.2f}")
print(f"Median PnL per trade: ${med_pnl:,.2f}")
print(f"Std dev PnL per trade: ${std_pnl:,.2f}")
print(f"Hit rate: {hit_rate:.3f}")
print(f"Worst trade: ${worst:,.2f}")
print(f"Best trade: ${best:,.2f}")
trades = trades.copy()
trades["abs_pos"] = trades["position_dollars"].abs()
tier_counts = trades["abs_pos"].value_counts().sort_index()
print("\nTrades by size:")
for size, count in tier_counts.items():
print(f" ${int(size):,}: {count} trades")
trades["ret_pct_of_pos"] = trades["pnl_dollars"] / trades["position_dollars"].abs()
avg_ret_pct = trades["ret_pct_of_pos"].mean()
med_ret_pct = trades["ret_pct_of_pos"].median()
std_ret_pct = trades["ret_pct_of_pos"].std(ddof=1)
worst_ret_pct = trades["ret_pct_of_pos"].min()
best_ret_pct = trades["ret_pct_of_pos"].max()
print("\nReturn per trade as % of position:")
print(f"Average: {100 * avg_ret_pct:.2f}%")
print(f"Median: {100 * med_ret_pct:.2f}%")
print(f"Std dev: {100 * std_ret_pct:.2f}%")
print(f"Worst: {100 * worst_ret_pct:.2f}%")
print(f"Best: {100 * best_ret_pct:.2f}%")
print(f"\nHit rate: {hit_rate*100:.1f}% of trades are profitable")
trades = trades.sort_values("day0").reset_index(drop=True)
start_date = trades["day0"].min()
end_date = trades["exit_date"].max() if trades["exit_date"].notna().any() else trades["day0"].max()
if pd.isna(start_date) or pd.isna(end_date):
years = np.nan
else:
years = (end_date - start_date).days / 365.25
total_pnl_pct_of_book = total_pnl / ASSUMED_CAPITAL
equity_final = 1.0 + total_pnl_pct_of_book
if years and years > 0:
ann_return = equity_final ** (1.0 / years) - 1.0
else:
ann_return = np.nan
# Equity curve & drawdown
r = trades["pnl_dollars"] / ASSUMED_CAPITAL
capital = 1.0
peak = 1.0
max_drawdown = 0.0
for rr in r:
capital *= (1.0 + rr)
if capital > peak:
peak = capital
dd = capital / peak - 1.0
if dd < max_drawdown:
max_drawdown = dd
trades_per_year = n_trades / years if years and years > 0 else np.nan
ann_vol = r.std(ddof=1) * math.sqrt(trades_per_year) if trades_per_year and trades_per_year > 0 else np.nan
sharpe = ann_return / ann_vol if ann_vol and ann_vol > 0 else np.nan
print(f"\nPortfolio view on ${ASSUMED_CAPITAL:,}:")
print(f"Total PnL as % of book: {100 * total_pnl_pct_of_book:.2f}%")
if years and years > 0:
print(f"Trading period: {start_date.date()} to {end_date.date()} (~{years:.2f} years)")
print(f"Trades per year: {trades_per_year:.2f}" if trades_per_year == trades_per_year else "Trades per year: n/a")
print(f"Approx annualised return on the book: {ann_return*100:.2f}%")
print(f"Final capital (starting from 1.0): {capital:.6f}")
print(f"Maximum drawdown: {max_drawdown:.6f}")
print(f"Approx annual volatility: {ann_vol:.6f}" if ann_vol == ann_vol else "Approx annual volatility: n/a")
print(f"Approx annual Sharpe: {sharpe:.3f}" if sharpe == sharpe else "Approx annual Sharpe: n/a")
print("\nPer-tier average return (as % of position):")
for size, count in tier_counts.items():
sub = trades[trades["abs_pos"] == size]
avg_pct_tier = (sub["pnl_dollars"] / sub["position_dollars"].abs()).mean()
print(f" Size ${int(size):,}: n={count}, avg = {100 * avg_pct_tier:.2f}%")
# =================================================
# Main backtest
# =================================================
def backtest_directional_split():
# ---- load event study ----
event_df = pd.read_csv(EVENT_STUDY_PATH)
# normalise date cols if present
for col in ["announce_date", "day0", "event_start", "event_end", "est_start", "est_end"]:
if col in event_df.columns:
event_df[col] = pd.to_datetime(event_df[col]).dt.normalize()
# choose CAR column
if "CAR_0_5" in event_df.columns:
car_col = "CAR_0_5"
elif "CAR" in event_df.columns:
car_col = "CAR"
else:
raise ValueError("event_study_car_0_5.csv must have CAR_0_5 or CAR column.")
event_df.rename(columns={car_col: "CAR_USED"}, inplace=True)
# ---- load features ----
feat_df = pd.read_csv(FEATURES_PATH)
for col in ["announce_date", "day0"]:
if col in feat_df.columns:
feat_df[col] = pd.to_datetime(feat_df[col]).dt.normalize()
merge_keys = ["ticker", "announce_date", "day0"]
df = pd.merge(event_df, feat_df, on=merge_keys, how="inner")
df = df[df["day0"] >= TRAIN_START_DATE].copy()
df = df.sort_values("day0").reset_index(drop=True)
print(f"Total events in sample (day0 >= {TRAIN_START_DATE.date()}): {len(df)}")
macro_dates = load_macro_dates(MACRO_CALENDAR_PATH)
prices = load_prices()
records = []
for i in range(len(df)):
row = df.iloc[i]
ticker = row["ticker"]
day0 = pd.to_datetime(row["day0"]).normalize()
announce_date = row["announce_date"]
car_0_5 = float(row["CAR_USED"])
x_i = row[FEATURE_COLS].values.astype(float)
score = np.nan
pos_dollars = 0.0
pnl_dollars = 0.0
raw_ret_0_5 = np.nan
exit_date = pd.NaT
# training = all past events
train = df.iloc[:i].dropna(subset=FEATURE_COLS + ["CAR_USED"])
if len(train) >= MIN_TRAIN_EVENTS:
X_train = train[FEATURE_COLS].values.astype(float)
y_train = train["CAR_USED"].values.astype(float)
X_mat = np.column_stack([np.ones(len(X_train)), X_train])
beta_hat, *_ = np.linalg.lstsq(X_mat, y_train, rcond=None)
intercept = beta_hat[0]
coef = beta_hat[1:]
y_hat_train = X_mat @ beta_hat
resid = y_train - y_hat_train
sigma_resid = resid.std(ddof=1)
mean_car = y_train.mean()
car_hat = intercept + np.dot(coef, x_i)
car_feat = car_hat - mean_car
score = car_feat / sigma_resid if sigma_resid > 0 else 0.0
# macro filter: skip if day0 is macro date
if day0 not in macro_dates:
pos_dollars = score_to_allocation_dollars(score)
else:
pos_dollars = 0.0
if pos_dollars != 0.0:
try:
px_tkr = prices.loc[ticker] # index = date
row0 = px_tkr.loc[day0]
open0 = float(row0["open"])
px_window = px_tkr.loc[day0:]
px_window = px_window.iloc[:6] # up to day0..day5
row_exit = px_window.iloc[-1]
exit_date = pd.to_datetime(row_exit.name).normalize()
close_exit = float(row_exit["adj_close"])
raw_ret_0_5 = (close_exit - open0) / open0
trade_cost = 2.0 * COST_RATE * abs(pos_dollars)
pnl_dollars = pos_dollars * raw_ret_0_5 - trade_cost
except KeyError:
pos_dollars = 0.0
pnl_dollars = 0.0
raw_ret_0_5 = np.nan
exit_date = pd.NaT
records.append({
"ticker": ticker,
"announce_date": announce_date,
"day0": day0,
"exit_date": exit_date,
"CAR_0_5": car_0_5,
"score": score,
"position_dollars": pos_dollars,
"raw_ret_0_5": raw_ret_0_5,
"pnl_dollars": pnl_dollars,
})
bt = pd.DataFrame(records)
bt.to_csv("backtest_directional_split_all_events.csv", index=False)
trades = bt[bt["position_dollars"] != 0].copy()
trades = trades.sort_values("day0").reset_index(drop=True)
trades.to_csv("backtest_directional_split_trades.csv", index=False)
print("\nSaved trade details to backtest_directional_split_trades.csv")
# full period
summarise_trades(trades, "FULL PERIOD (2010+)")
# tuning
tuning_trades = trades[trades["day0"] <= TUNING_END_DATE].copy()
summarise_trades(tuning_trades, "TUNING PERIOD (2010–2018)")
# forward
fwd_trades = trades[trades["day0"] >= FORWARD_START_DATE].copy()
summarise_trades(fwd_trades, "FORWARD TEST (2019–2024+)")
return bt, trades
if __name__ == "__main__":
backtest_directional_split()
Total events in sample (day0 >= 2010-01-01): 189 Loaded 940 unique macro dates to avoid (day0 only). Saved trade details to backtest_directional_split_trades.csv ================ FULL PERIOD (2010+) ================ Number of trades: 13 Total PnL: $258,254.62 Average PnL per trade: $19,865.74 Median PnL per trade: $14,111.36 Std dev PnL per trade: $16,718.96 Hit rate: 0.846 Worst trade: $-1,500.84 Best trade: $51,908.29 Trades by size: $200,000: 8 trades $400,000: 2 trades $600,000: 3 trades Return per trade as % of position: Average: 6.06% Median: 5.98% Std dev: 4.68% Worst: -0.75% Best: 15.40% Hit rate: 84.6% of trades are profitable Portfolio view on $10,000,000: Total PnL as % of book: 2.58% Trading period: 2016-11-11 to 2024-02-29 (~7.30 years) Trades per year: 1.78 Approx annualised return on the book: 0.35% Final capital (starting from 1.0): 1.026118 Maximum drawdown: -0.000150 Approx annual volatility: 0.002231 Approx annual Sharpe: 1.568 Per-tier average return (as % of position): Size $200,000: n=8, avg = 6.26% Size $400,000: n=2, avg = 3.49% Size $600,000: n=3, avg = 7.24% ================ TUNING PERIOD (2010–2018) ================ Number of trades: 2 Total PnL: $44,913.63 Average PnL per trade: $22,456.82 Median PnL per trade: $22,456.82 Std dev PnL per trade: $11,802.25 Hit rate: 1.000 Worst trade: $14,111.36 Best trade: $30,802.27 Trades by size: $200,000: 2 trades Return per trade as % of position: Average: 11.23% Median: 11.23% Std dev: 5.90% Worst: 7.06% Best: 15.40% Hit rate: 100.0% of trades are profitable Portfolio view on $10,000,000: Total PnL as % of book: 0.45% Trading period: 2016-11-11 to 2018-11-26 (~2.04 years) Trades per year: 0.98 Approx annualised return on the book: 0.22% Final capital (starting from 1.0): 1.004496 Maximum drawdown: 0.000000 Approx annual volatility: 0.001169 Approx annual Sharpe: 1.882 Per-tier average return (as % of position): Size $200,000: n=2, avg = 11.23% ================ FORWARD TEST (2019–2024+) ================ Number of trades: 11 Total PnL: $213,340.99 Average PnL per trade: $19,394.64 Median PnL per trade: $13,853.42 Std dev PnL per trade: $17,886.09 Hit rate: 0.818 Worst trade: $-1,500.84 Best trade: $51,908.29 Trades by size: $200,000: 6 trades $400,000: 2 trades $600,000: 3 trades Return per trade as % of position: Average: 5.12% Median: 5.82% Std dev: 4.06% Worst: -0.75% Best: 12.86% Hit rate: 81.8% of trades are profitable Portfolio view on $10,000,000: Total PnL as % of book: 2.13% Trading period: 2019-02-15 to 2024-02-29 (~5.04 years) Trades per year: 2.18 Approx annualised return on the book: 0.42% Final capital (starting from 1.0): 1.021526 Maximum drawdown: -0.000150 Approx annual volatility: 0.002643 Approx annual Sharpe: 1.589 Per-tier average return (as % of position): Size $200,000: n=6, avg = 4.60% Size $400,000: n=2, avg = 3.49% Size $600,000: n=3, avg = 7.24%
In [ ]: